Combining Xarray datasets is too slow

413 Views Asked by At

I am working with satellite data which has measurements over different locations for different times. I am working with netcdf data in the form of xarray Datasets. I'd like a "union" of measurements over time, such that I get the whole spatial coverage for a given window of time. I understand that xarray.merge can help here. However, none of the values for the compat argument seem to help in my case. There is likely some spatial overlap in the measurements from different times, and I want the latest values. If I have misunderstood something here, please enlighten me.

The way I tried to do it was to use xarray.Dataset.combine_first, where the datasets can be merged with a union, keeping the values of the first dataset. I do this recursively for all timesteps I have. The function I have is:

def combine_in_time(ds,
                    start:str,
                    end:str,
                    varname:str) -> xr.DataArray:
    """returns DataArray with values combined in time
    """
    ds = ds.sortby('time')
    ds_list = [ds[varname].sel(time=time) for time in ds.time.sel(time=slice(start,end))]

    ds.time.sel(time=slice(start,end))

    def combine(ds_list):
        
        if len(ds_list) == 0:
            pass
        elif len(ds_list) == 1:
            return ds_list[0]
        else:
            ds_list[0] = ds_list[0].combine_first(ds_list[1])
            ds_list.pop(1)
            return combine(ds_list)

    return combine(ds_list)

Now this works very slowly. My guess is that it has to combine datasets multiple times, i.e. once each for the additional time coordinates. What can be done to speed this up? Is there a way to merge data from all time coordinates together? It also seems like xarray.reduce might help my case but I am not sure how to implement the function...

Any help on this question would be greatly appreciated :)

0

There are 0 best solutions below