I am reading a ZARR file from a s3 bucket with xarray. I got to successfully filter by time and latitude/longitude:
def read_zarr(self, dataset: str, region: Region) -> Any:
# Read ZARR from s3 bucket
fs = s3fs.S3FileSystem(key="KEY", secret="SECRET")
mapper = fs.get_mapper(f"{self.S3_PATH}{dataset}")
zarr_ds = xr.open_zarr(mapper, decode_times=True)
# Filter by time
time_period = pd.date_range("2013-01-01", "2023-01-31")
zarr_ds = zarr_ds.sel(time=time_period)
# Filter by latitude/longitude
region_gdf = region.geo_data_frame
latitude_slice = slice(region_gdf.bounds.miny[0], region_gdf.bounds.maxy[0])
longitude_slice = slice(region_gdf.bounds.minx[0], region_gdf.bounds.maxx[0])
return zarr_ds.sel(latitude=latitude_slice, longitude=longitude_slice)
The problem is that this returns a rectangle of data (actually a cuboid, if we consider the time dimension). For geographical regions that are long and thin, this will represent a huge waste, as I will first download years of data, to then discard most of it. Example with California:
I would like to intersect the ZARR coordinates with the region ones. How can I achieve it?

One option is to use
rioxarrayand then clip to your region, very similar to this thread.You will need to set the coordinate system. Assuming your data are given as standard lat/lon, this is epsg 4326 (see opening sentences here). You may need to use
region_gdf.to_crs('epsg:4326')to make sure that the coordinate systems are the same (orset_crsif it doesn't have a coordinate reference system already).Once this formatting is done, you can use the
rio.clipmethod and setdrop=Trueto only keep the points in the region.New lines below - it might not work exactly as written because I can't test it, but I think it should get you close.