I want to extract Sentinel-1-RTC satellite data and use it as input for either a Keras CNN or SKLearn model (this is for the ongoing EY Open Data Science Challenge 2023). Loading the pixel data directly takes a long time, hence I opted to load the data (VV and VH bands) as Dask arrays. The following is a sample code for a single coordinate point:
import pystac_client
import planetary_computer as pc
from odc.stac import stac_load
latlong = (10.323727047081501, 105.2516346045924)
box_size_deg = 0.002
min_lon = float(latlong[1])-box_size_deg/2
min_lat = float(latlong[0])-box_size_deg/2
max_lon = float(latlong[1])+box_size_deg/2
max_lat = float(latlong[0])+box_size_deg/2
bbox = (min_lon , min_lat, max_lon, max_lat)
time_slice = "2022-01-01/2022-12-31"
scale = 10/111320.0
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(
collections=["sentinel-1-rtc"], bbox=bbox, datetime=time_slice)
items = search.get_all_items()
scale = 10/111320.0
test = stac_load(items, patch_url=pc.sign, bbox=bbox, bands=assets,
chunks={}, crs="EPSG:4326", resolution=scale)
print(test)
The output for the following is as such
<xarray.Dataset>
Dimensions: (latitude: 23, longitude: 23, time: 2)
Coordinates:
* latitude (latitude) float64 10.32 10.32 10.32 ... 10.32 10.32 10.32
* longitude (longitude) float64 105.3 105.3 105.3 ... 105.3 105.3 105.3
spatial_ref int32 4326
* time (time) datetime64[ns] 2022-01-09T22:46:06.347730 2022-01-10T...
Data variables:
vh (time, latitude, longitude) float32 dask.array<chunksize=(1, 23, 23), meta=np.ndarray>
vv (time, latitude, longitude) float32 dask.array<chunksize=(1, 23, 23), meta=np.ndarray>
The Dask array for the "vh" and "vv" variables are only about 118kiB.
I would like to convert the Dask array to a numpy array using test.compute(), but it takes more than 40 seconds to run on my local machine. I have 600 coordinate points to run so this is not ideal. The task graph for the Dask array test.vv.data is shown below:
Task graph (zoom-in for details):

How can I speed up the time to convert Dask to numpy array?
I have tried rechunking the Dask array but it does not reduce the time taken. I am open to suggestions to directly use the Dask array as input for my model.