I am working with the raster and glcm packages to compute Haralick texture features on satellite imagery. I have successfully run the glcm() function using a single core but am working on running it in parallel. Here is the code I'm using:
# tiles is a list of raster extents, r is a raster
registerDoParallel(7)
out_raster = foreach(i=1:length(tiles),.combine = merge,.packages=c("raster","glcm")) %dopar%
glcm(crop(r,tiles[[i]]), n_grey=16, window=c(17,17), shift = c(1,1),
min_x = rmin, max_x = rmax)
When I examine the temp files that are created, it appears each step of the merge creates a temp file, which takes a lot of hard drive space. Here is the overall image (2GB):
and here are two of the temp files: Merge Step 1 Merge Step 2
Because the glcm function output for each tile is 3 GB, creating a temp file for each stepwise merge operation creates ~160GB of temp raster files. Is there a more space efficient way to run this in parallel?
I managed to save hard drive space by using gdal and building vrts. Below is the code I wrote running on the example data from the glcm package. The steps were 1: Create vrt files of the tiles; 2) Run the glcm function in parallel on each vrt tile (see glcm_parallel function); 3) Merge the tiles into a vrt and write the output raster using gdal warp. The vrt files are very small and the only temp files are just those created by the glcm function. This should help a lot with large rasters.
The two helper functions are here:
and here: