I am analyzing ERA5 climate data spanning from 1950 to 2023, covering 73 years. The dataset has a daily temporal resolution and a spatial resolution of 0.25° (1440 x 720).
Each yearly data is stored in a NetCDF file, containing approximately 365 layers.
Attempting to combine these into a single stack results in around 27000 layers, which is overwhelming for my computer, and even the workplace cluster struggles with this volume.
As I'm new to handling large-scale datasets, I'm seeking advice on best practices and effective solutions for managing and processing this type of data.
Note: I have a preference for working with dataframes over NetCDF files as I will perform time series analysis.
Note 2: different types of analysis will be performed in the future. It's very wide: from simple statistics, to indices computations etc. On a grid cell level.
Note 3: I have heard of processing the data in smaller chunks. Is there a good package to do so in R?
You have global coverage of ERA5 data over 73 years, giving you
1440 * 720 * 73 * 365.2425 = 27.6 billiondata points for each variable (like temperature, or precipitation). No computer system will process that without chunking.First of all, forget about working with
data.frames on the whole data set. First you need to process it down to a reasonable size and only then can you use adata.framefor further analysis (if you really need it).Secondly, ERA5 data uses the
gregorianorstandardcalendar. That means you can use the standardDateclass, for instance to make factors with which to process the data. However, since ERA5 data conforms to the CF Metadata Conventions you are better off with package CFtime which gives you better control over the time series.Sample by location
An obvious option to break down the data set is to sample the time series by location, pixel by pixel. This is a bad choice, however, because your data is spread over 73 files and the storage organization inside each file makes this a very slow option (data is scattered all over each file). I/O nightmare, don't even consider it.
Temporal slice
Your best bet is to process each file individually and then assemble the yearly output into a data object over the entire time series. If you go from daily data to monthly statistics, you'll have about 900 million data points per statistic (e.g. monthly maximum T, monthly minimum T). Still very significant, but doable on a good computer. That would look somewhat like this:
That still gives you a very large object but at least it'll be manageable. If you want different statistics you can write your own functions and then call that
aperm(apply(data, 1:2, tapply, fac, <<<here>>>), c(2, 3, 1)).