I want to summarize a huge data frame in R in such a way that I only need unique "lat", "lon", "Date (Year)" and "Maxium Value"

29 Views Asked by AdeB At 28 March 2024 at 23:24

I need to summarize a huge CSV file (nrow =1102300). This is daily climate data from various climatic models.

First, I want to sum up all the columns with "historical" names. My target is the maximum values filtered by Date (in Year Year (i.e. 1950, 1951, etc ) for all the unique "lat" and "lon."

All help will be appreciated.

The dataframe is like this:

df = read.csv(text = '"lat","lon","Date","pr_CMCC.ESM2_historical","pr_GFDL.ESM4_historical_ssp126","pr_BCC.CSM2.MR_historical_ssp126","pr_INM.CM4.8_historical_ssp126","pr_FGOALS.g3_historical_ssp126","pr_TaiESM1_historical_ssp126","pr_NorESM2.MM_historical_ssp126","pr_CanESM5_historical_ssp126","pr_KIOST.ESM_historical_ssp126","pr_NorESM2.LM_historical_ssp126","pr_INM.CM5.0_historical_ssp126"
46.29166646,-62.62500314,1/1/1950 12:00,1.7243347,6.10E-05,6.10E-05,2.5483093,1.7853699,6.10E-05,1.846405,6.10E-05,1.4954529,1.4496765,3.769043
46.29166646,-62.62500314,1/2/1950 12:00,6.10E-05,6.10E-05,6.10E-05,9.24704,6.10E-05,12.741333,6.10E-05,6.424103,0.56463623,6.10E-05,1.1139832
46.29166646,-62.62500314,1/3/1950 12:00,6.10E-05,6.10E-05,6.10E-05,6.10E-05,6.10E-05,1.052948,6.10E-05,1.1445007,6.10E-05,6.10E-05,6.10E-05
46.29166646,-62.62500314,1/4/1950 12:00,7.965271,6.10E-05,6.10E-05,6.5919495,1.9684753,6.10E-05,6.10E-05,1.4191589,6.10E-05,0.70196533,3.9368896',header = TRUE)

I want my final output dataframe arrainged like this:

lat | lon | Value

Where Value = Maximum value for each year.

Original Q&A

There are 2 best solutions below

r2evans On 28 March 2024 at 23:35 BEST ANSWER

Perhaps this:

df |>
  mutate(
    Date = as.POSIXct(Date, format = "%m/%d/%Y %H:%M"),
    Year = format(Date, format = "%Y")
  ) |>
  summarize(
    .by = c("lon", "lat", "Year"),
    Value = max(rowSums(pick(matches("historical"))))
  )
#       lon      lat Year    Value
# 1 -62.625 46.29167 1950 30.09146

This assumes that your lon and lat are relatively invariant, though even with that sometimes grouping by floating point numbers may be problematic (see Why are these numbers not equal? and https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f). I suggest that if the rest of your data has any variance in lon and lat for each year, then there are likely different things you should be doing to aggregate your data.

The use of .by= requires dplyr_1.1.0 or newer; if you have an older version, change from summarize(.by=c(..), stuff) to group_by(..) |> summarize(stuff) |> ungroup().

Jon Spring On 29 March 2024 at 00:08

library(tidyverse) # presumes tidyverse 2.0+ so includes dplyr 1.1.0+ for ".by"
                   # and lubridate included with tidyverse load
df |>
  pivot_longer(contains("historical")) |>
  mutate(yr = year(mdy_hm(Date))) |>
  summarize(value = max(value), .by = c(lat, lon, yr))

Result

# A tibble: 1 × 4
    lat   lon    yr value
  <dbl> <dbl> <dbl> <dbl>
1  46.3 -62.6  1950  12.7

I want to summarize a huge data frame in R in such a way that I only need unique "lat", "lon", "Date (Year)" and "Maxium Value"

There are 2 best solutions below

Related Questions in R

Related Questions in DATAFRAME

Related Questions in DPLYR

Related Questions in AGGREGATE

Related Questions in DATA-MANIPULATION

Trending Questions

Popular # Hahtags

Popular Questions