I need to summarize a huge CSV file (nrow =1102300). This is daily climate data from various climatic models.

First, I want to sum up all the columns with "historical" names. My target is the maximum values filtered by Date (in Year Year (i.e. 1950, 1951, etc ) for all the unique "lat" and "lon."

All help will be appreciated.

The dataframe is like this:

df = read.csv(text = '"lat","lon","Date","pr_CMCC.ESM2_historical","pr_GFDL.ESM4_historical_ssp126","pr_BCC.CSM2.MR_historical_ssp126","pr_INM.CM4.8_historical_ssp126","pr_FGOALS.g3_historical_ssp126","pr_TaiESM1_historical_ssp126","pr_NorESM2.MM_historical_ssp126","pr_CanESM5_historical_ssp126","pr_KIOST.ESM_historical_ssp126","pr_NorESM2.LM_historical_ssp126","pr_INM.CM5.0_historical_ssp126"
46.29166646,-62.62500314,1/1/1950 12:00,1.7243347,6.10E-05,6.10E-05,2.5483093,1.7853699,6.10E-05,1.846405,6.10E-05,1.4954529,1.4496765,3.769043
46.29166646,-62.62500314,1/2/1950 12:00,6.10E-05,6.10E-05,6.10E-05,9.24704,6.10E-05,12.741333,6.10E-05,6.424103,0.56463623,6.10E-05,1.1139832
46.29166646,-62.62500314,1/3/1950 12:00,6.10E-05,6.10E-05,6.10E-05,6.10E-05,6.10E-05,1.052948,6.10E-05,1.1445007,6.10E-05,6.10E-05,6.10E-05
46.29166646,-62.62500314,1/4/1950 12:00,7.965271,6.10E-05,6.10E-05,6.5919495,1.9684753,6.10E-05,6.10E-05,1.4191589,6.10E-05,0.70196533,3.9368896',header = TRUE)

I want my final output dataframe arrainged like this:

lat | lon | Value

Where Value = Maximum value for each year.

2

There are 2 best solutions below

1
r2evans On BEST ANSWER

Perhaps this:

df |>
  mutate(
    Date = as.POSIXct(Date, format = "%m/%d/%Y %H:%M"),
    Year = format(Date, format = "%Y")
  ) |>
  summarize(
    .by = c("lon", "lat", "Year"),
    Value = max(rowSums(pick(matches("historical"))))
  )
#       lon      lat Year    Value
# 1 -62.625 46.29167 1950 30.09146

This assumes that your lon and lat are relatively invariant, though even with that sometimes grouping by floating point numbers may be problematic (see Why are these numbers not equal? and https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f). I suggest that if the rest of your data has any variance in lon and lat for each year, then there are likely different things you should be doing to aggregate your data.


The use of .by= requires dplyr_1.1.0 or newer; if you have an older version, change from summarize(.by=c(..), stuff) to group_by(..) |> summarize(stuff) |> ungroup().

0
Jon Spring On
library(tidyverse) # presumes tidyverse 2.0+ so includes dplyr 1.1.0+ for ".by"
                   # and lubridate included with tidyverse load
df |>
  pivot_longer(contains("historical")) |>
  mutate(yr = year(mdy_hm(Date))) |>
  summarize(value = max(value), .by = c(lat, lon, yr))

Result

# A tibble: 1 × 4
    lat   lon    yr value
  <dbl> <dbl> <dbl> <dbl>
1  46.3 -62.6  1950  12.7