Overall, I have a data frame with buildings with a spatial variable connected. Then I have another file of e.g. forests, similarly
Total_df:
| Id | Variables | SFC Point object | Zip codes |
|---|---|---|---|
| 1 | 10 | POINT (543611.8 6389285) | 2324 |
| 2 | 15 | POINT (513611.8 6349285) | 2324 |
| 3 | 12 | POINT (533611.8 6359285) | 2329 |
About 2 mil observations
forest_distance:
| Id | Variables | SFC Polygon object |
|---|---|---|
| 1 | 10 | POLYGON Z ((455302.7 6252026 9.09, 455292.6 6252034 9.09, 455274.8 6252036 9.9, 455246 6252113 14.25, 455286.1 6252124 14.15, 455293.5 6252126 14.13, 455317.8 6252068 14.13, 455331.5 6252073 14.13, 455345.5 6252044 14.78, 455302.7 6252026 9.09)) |
The forest_distance is saved in list, where the original forest_distance is divided into 10 equal parts.
The distance between I have figured out to do, and I have also split up the Total_df, to do it on smaller subsets decided by zip codes.
But now, to speed up the calculation, I want to do a parallelisation, where I also subdivide the forest_distance to smaller files.
I imagine it would be quicker to do a parallelisation, such that each session does a part of the subdivided forest_distance.
Also, is it possible to print from the different sessions, to see the progress?
registerDoParallel(cores = 6)
# Use foreach to loop over list.dfs in parallel
foreach(d = 1:length(list.dfs), .packages = "sf", .combine = 'c') %dopar% {
# Get the data frame at position 'd' in the list
df <- list.dfs[[d]]
# Open a list to store combined inner results
grand_list <- list()
# Initialize an empty list to store the results of the inner loop
inner_results <- list()
# zip_code
zipcode <- sort(unique(Total_df$zipcode))
# Use a regular for loop to iterate over zipcode
for(i in zipcode) {
cat(i, "\n")
start_time <- Sys.time()
# Subset the data
subset_df <- Total_df[Total_df$zipcode == i, ]
if(nrow(subset_df) > 0) {
# Calculate distances
distances <- sf::st_distance(subset_df, df)
# Define the 'miin' function, or replace it with an appropriate function
miin <- function(x) min(x, na.rm = TRUE)
# Calculate minimum distances
min_distances <- apply(distances, 1, miin)
# Store minimum distances in a new column
subset_df$min_distances <- min_distances
}
end_time <- Sys.time()
print(paste("Time for municipality Forest", i, ": ", end_time - start_time))
# Store the updated subset_df in the inner_results list
inner_results[[i]] <- subset_df
}
# Combine the results of the inner loop using do.call
grand_list[[d]] <- do.call(rbind, inner_results)
}
It has run for many hours, and had to stop it, but it has not saved any results during.
This is untried, but a rewrite towards something like this might work:
(The prints that you do may not work though)
Tip: Debug your code with %do% instead of %dopar% , and run only the first two values:
Fill this with debug statements etc to your heart's content. When it works, remove [1:2] and change it to dopar instead.