Solutions for handling large-scale climate dataset in R

92 Views Asked by At

I am analyzing ERA5 climate data spanning from 1950 to 2023, covering 73 years. The dataset has a daily temporal resolution and a spatial resolution of 0.25° (1440 x 720).

Each yearly data is stored in a NetCDF file, containing approximately 365 layers.

Attempting to combine these into a single stack results in around 27000 layers, which is overwhelming for my computer, and even the workplace cluster struggles with this volume.

As I'm new to handling large-scale datasets, I'm seeking advice on best practices and effective solutions for managing and processing this type of data.

Note: I have a preference for working with dataframes over NetCDF files as I will perform time series analysis.

Note 2: different types of analysis will be performed in the future. It's very wide: from simple statistics, to indices computations etc. On a grid cell level.

Note 3: I have heard of processing the data in smaller chunks. Is there a good package to do so in R?

1

There are 1 best solutions below

5
Patrick On

You have global coverage of ERA5 data over 73 years, giving you 1440 * 720 * 73 * 365.2425 = 27.6 billion data points for each variable (like temperature, or precipitation). No computer system will process that without chunking.

First of all, forget about working with data.frames on the whole data set. First you need to process it down to a reasonable size and only then can you use a data.frame for further analysis (if you really need it).

Secondly, ERA5 data uses the gregorian or standard calendar. That means you can use the standard Date class, for instance to make factors with which to process the data. However, since ERA5 data conforms to the CF Metadata Conventions you are better off with package CFtime which gives you better control over the time series.

Sample by location

An obvious option to break down the data set is to sample the time series by location, pixel by pixel. This is a bad choice, however, because your data is spread over 73 files and the storage organization inside each file makes this a very slow option (data is scattered all over each file). I/O nightmare, don't even consider it.

Temporal slice

Your best bet is to process each file individually and then assemble the yearly output into a data object over the entire time series. If you go from daily data to monthly statistics, you'll have about 900 million data points per statistic (e.g. monthly maximum T, monthly minimum T). Still very significant, but doable on a good computer. That would look somewhat like this:

# Attach necessary packages
library(ncdf4)
library(CFtime)
library(abind)

# Get a list of your 73 ERA5 files
lf <- list.files("~/your/path/to/ERA5", "\\.nc$", full.names = TRUE)

# Open the files sequentially and process
mon <- lapply(lf, function(fn) {
  # Open the file and create a CFtime object
  nc <- nc_open(fn)
  cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar, nc$dim$time$vals)

  # Create a factor to make monthly statistics
  fac <- CFfactor(cf, "month")

  # Read and process the data, here the monthly maximum T
  data <- ncvar_get(nc, "t2m")
  aperm(apply(data, 1:2, tapply, fac, max), c(2, 3, 1))
  dimnames(data) <- list(as.vector(nc$dim$longitude$vals),
                         as.vector(nc$dim$latitude$vals),
                         levels(fac))
  nc_close(nc)
  data
}

# `mon` is a list with 73 elements, each with 12 layers of monthly max T
# Now put them all into 1 object
all_mon <- abind(mon, along = 3)

That still gives you a very large object but at least it'll be manageable. If you want different statistics you can write your own functions and then call that aperm(apply(data, 1:2, tapply, fac, <<<here>>>), c(2, 3, 1)).