Fastest (Best) way to extract raster values from > 150 very large images with 4,000,000 unique objects (pixel locations)

68 Views Asked by At

I am trying to extract values for 4,000,000 change objects (multiple pixels of various sizes and shapes over 39 years) from +150 perfectly aligned data layers of a very large array (15812 X 67797). The image area can't be easily divided due to the unique shapes of the individual objects. I have the change objects both in a raster format and vector format. Each change object has a unique id.

I have two working versions that are acceptably fast over smaller areas but are painfully slow at larger scales.

The first version uses rasterstats.zonal_stats() to extract the information.

In the second approach, I create a dictionary of the objects with the object ID as the key and a list of array indexes of the objects(using numpy.where()) as the dictionary values. I have dictionary for each year (~100,000 objects/ year-dictionary). I pickle these yearly dictionaries for repeated use throughout the remainder of the process. I then use (numpy.take() to extract the values from the various images at different stages of the process.

Here's a striped down version of the second approach. (without pickling)

import rasterio as rio
import numpy as np

for yIDX,year in enumerate(TARGET_YEARS):
    IDX_ht={}
    with rio.open(<filename>) as C:
        yids_in = C.read(yIDX+1).reshape(-1)
    yids_unique= np.unique(yids_in [yids_in > 0])
    for yid in yids:                          # this for loop is the bottleneck
        IDX_ht[yid]= np.where(yids_in == yid) 
   

...

for img in images:
resultHt ={}
    with rio.open(img) as D:
        data_in = D.read(1).reshape(-1)
    for obj_id in IDX_ht:
        tgt_idx= IDX_ht[obj_id]
        resultHt[obj_id] = np.take(data_in ,tgt_idx)
            

...

( I use np.put() to build my final results using the same IDX_ht object. )

Both approaches work but are slow over larger areas. I have big machines to do this processing so resource contention is not an issue. I have implemented multiprocessing to try and speed up the process. The bottleneck in the second process is building the dictionary of array index values. Does anyone have any suggestions of another approach that would be more efficient? Thanks for the support.

0

There are 0 best solutions below