Speeding up Dask array compute time (convert to numpy array)

168 Views Asked by At

I want to extract Sentinel-1-RTC satellite data and use it as input for either a Keras CNN or SKLearn model (this is for the ongoing EY Open Data Science Challenge 2023). Loading the pixel data directly takes a long time, hence I opted to load the data (VV and VH bands) as Dask arrays. The following is a sample code for a single coordinate point:

import pystac_client
import planetary_computer as pc
from odc.stac import stac_load

latlong = (10.323727047081501, 105.2516346045924)

box_size_deg = 0.002

min_lon = float(latlong[1])-box_size_deg/2
min_lat = float(latlong[0])-box_size_deg/2
max_lon = float(latlong[1])+box_size_deg/2
max_lat = float(latlong[0])+box_size_deg/2

bbox = (min_lon , min_lat, max_lon, max_lat)
time_slice = "2022-01-01/2022-12-31"
scale = 10/111320.0

catalog = pystac_client.Client.open(
        "https://planetarycomputer.microsoft.com/api/stac/v1")

search = catalog.search(
        collections=["sentinel-1-rtc"], bbox=bbox, datetime=time_slice)

items = search.get_all_items()
scale = 10/111320.0

test = stac_load(items, patch_url=pc.sign, bbox=bbox, bands=assets,
                 chunks={}, crs="EPSG:4326", resolution=scale)

print(test)

The output for the following is as such

<xarray.Dataset>
Dimensions:      (latitude: 23, longitude: 23, time: 2)
Coordinates:
  * latitude     (latitude) float64 10.32 10.32 10.32 ... 10.32 10.32 10.32
  * longitude    (longitude) float64 105.3 105.3 105.3 ... 105.3 105.3 105.3
    spatial_ref  int32 4326
  * time         (time) datetime64[ns] 2022-01-09T22:46:06.347730 2022-01-10T...
Data variables:
    vh           (time, latitude, longitude) float32 dask.array<chunksize=(1, 23, 23), meta=np.ndarray>
    vv           (time, latitude, longitude) float32 dask.array<chunksize=(1, 23, 23), meta=np.ndarray>

The Dask array for the "vh" and "vv" variables are only about 118kiB.

I would like to convert the Dask array to a numpy array using test.compute(), but it takes more than 40 seconds to run on my local machine. I have 600 coordinate points to run so this is not ideal. The task graph for the Dask array test.vv.data is shown below:

Task graph (zoom-in for details): task-graph

How can I speed up the time to convert Dask to numpy array?

I have tried rechunking the Dask array but it does not reduce the time taken. I am open to suggestions to directly use the Dask array as input for my model.

0

There are 0 best solutions below