I have a large multidimensional xarray.DataArray. It has 4 dimensions of which one is time.
Time is measured in seconds and for many of these times all values (in all dimensions) are zero. Since this data array quickly gets very large I was hoping to avoid storing all the zero's (to make it sparse). Below is how the dataarray is build
dataarray = xr.DataArray(data=data, coords=[time, coord1, coord2, coord3], dims=['time', 'coord1', 'coord2', 'coord3'])
data is a numpy array which is initialized as:
data= np.zeros((len(time), len(coord1), len(coord2), len(coord3))
I have found a solution that removes all the timesteps with only zeros (it seems to work in my preliminary tests) It does atleast decrease the memory size of the dataarray by 10 times. However it is extremely slow to the point that it is not workable because it would have to happen many times:
times_to_drop = [timestamp for timestamp in dataarray.time.values[2:(len(dataarray.time.values)-1)] if not np.any(dataarray.sel(time=timestamp).values)]
dataarray = dataarray.drop_sel(time=times_to_drop)
I am not dropping the first two and the last timestep on purpose so I can use them to infer the timestep, starttime and endtime
My question is. Can this be done faster (a lot faster) by either imrpoving my own solution or by employing a completely different one. I am building on existing software so I rather consider solutions that build on this xarray.dataArray implementation than to consider complete overhauls