Can you collapse duplicate rows in R, keeping values based on a selection of columns?

52 Views Asked by At

I have a very large dataset with duplicate values in one column (UNIQUEID). I want to group by that column, selecting the values from rows LAT, LONG from the latest quarter (identified where rows Q4 2023, Q3 2023, Q2 2023=1) while also collapsing across those last three quarterly rows.

For example, my data looks like this:

   UNIQUEID   LAT LONG   Q4 2023   Q3 2023   Q2 2023
1: 8F5         51   -8      1       NA         NA
2: W7T         53   -6      1       NA         NA
3: 9F2         51   -8      NA      1          NA
4: 8F5         51   -9      NA      1          1
5: 9F2         58   -8      NA      1          1
6: W7T         53   -6      NA      1          NA

And the result I am looking for is:

   UNIQUEID   LAT LONG   Q4 2023   Q3 2023   Q2 2023
1: 8F5         51   -8      1       1          1
2: W7T         53   -6      1       1         NA
3: 9F2         51   -8      NA      1          1

The best I've done so far is by summarizing across everything:

df <- df %>% 
  group_by(UNIQUEID) %>% 
  summarise(across(everything(), ~ paste(unique(.x[!is.na(.x)]), collapse = ", ")))

But I am not sure if I can guarantee that taking the first value in the LAT/LONG cells will get me the data from latest quarter. I think it should be, but I want to be certain.

Alternatively, I could rerun the original table so that instead of 1 and NA, I use a descending value per quarter (ex. Q4 2023 = 20 or NA; Q3 2023 = 19 or NA)? But then I am still unsure of how to take the max of a list of columns.

I don't want to unpivot those quarter columns because the dataset now is approx. 3 millions rows and I am trying to keep it manageable.

DATA

structure(list(UNIQUEID = c("8F5", "W7T", "9F2", "8F5", "9F2", 
"W7T"), LAT = c(51L, 53L, 51L, 51L, 58L, 53L), LONG = c(-8L, 
-6L, -8L, -9L, -8L, -6L), Q4.2023 = c(1L, 1L, NA, NA, NA, NA), 
    Q3.2023 = c(NA, NA, 1L, 1L, 1L, 1L), Q2.2023 = c(NA, NA, 
    NA, 1L, 1L, NA)), class = "data.frame", row.names = c(NA, 
-6L))
2

There are 2 best solutions below

0
Jon Spring On BEST ANSWER
df |>
  group_by(UNIQUEID) |>
  tidyr::fill(Q4.2023:Q2.2023,  .direction = "downup") |>
  ungroup() |>
  slice(1, .by = UNIQUEID) 


# A tibble: 3 × 6
  UNIQUEID   LAT  LONG Q4.2023 Q3.2023 Q2.2023
  <chr>    <int> <int>   <int>   <int>   <int>
1 8F5         51    -8       1       1       1
2 W7T         53    -6       1       1      NA
3 9F2         51    -8      NA       1       1
0
Bastián Olea Herrera On

In order to correctly filter your data to get UNIQUEIDs for the latest quarter only, we have to transform your data a bit, because you have multiple columns that each contains two different datapoints (the quarter columns). This can complicate things. So first we have to transform your multiple quarter columns into two different columns: quarter and year. Then we can arrange our data according to quarter and year, create a new column to figure out which IDs are the latest, and finally filter from that new column to only leave latest rows in your dataframe.

library(dplyr)
library(tidyr)
library(stringr)

# we need to transform our data to tidy format, so that each row is equal to a single observation. This way we can filter quarters and years more easily, as each will be its own single column, instead of having multiple columns for quarter+year:
df2 <- df |> 
  pivot_longer(cols = starts_with("Q"), names_to = "quarter_year", values_to = "value") |> #transform from multiple columns for year+quarter into a single column for year+quarter that we can modify further
  filter(!is.na(value)) |> #leave only rows that correspond with a quarter
  mutate(quarter = str_extract(quarter_year, "Q\\d+"), #extract the number just after the Q character
         quarter = str_remove(quarter, "Q"), #remove the Q character
         quarter = as.integer(quarter)) |> 
  mutate(year = str_extract(quarter_year, "\\d{4}")) |> #extract numbers of four digits, i.e., years
  arrange(UNIQUEID, desc(year), desc(quarter))


# then, we create a new column that will be TRUE if the observation from each UNIQUEID is the latest, and FALSE for all others. This way we make sure that we have the latest value for each ID group:
df3 <- df2 |> 
  group_by(UNIQUEID) |> 
  mutate(latest = if_else(year == max(year, na.rm=T) & quarter == max(quarter, na.rm=T), TRUE, FALSE))

# finally, we filter:
df3 |> 
  group_by(UNIQUEID) |> 
  filter(latest == TRUE)