I have a very large dataset with duplicate values in one column (UNIQUEID). I want to group by that column, selecting the values from rows LAT, LONG from the latest quarter (identified where rows Q4 2023, Q3 2023, Q2 2023=1) while also collapsing across those last three quarterly rows.
For example, my data looks like this:
UNIQUEID LAT LONG Q4 2023 Q3 2023 Q2 2023
1: 8F5 51 -8 1 NA NA
2: W7T 53 -6 1 NA NA
3: 9F2 51 -8 NA 1 NA
4: 8F5 51 -9 NA 1 1
5: 9F2 58 -8 NA 1 1
6: W7T 53 -6 NA 1 NA
And the result I am looking for is:
UNIQUEID LAT LONG Q4 2023 Q3 2023 Q2 2023
1: 8F5 51 -8 1 1 1
2: W7T 53 -6 1 1 NA
3: 9F2 51 -8 NA 1 1
The best I've done so far is by summarizing across everything:
df <- df %>%
group_by(UNIQUEID) %>%
summarise(across(everything(), ~ paste(unique(.x[!is.na(.x)]), collapse = ", ")))
But I am not sure if I can guarantee that taking the first value in the LAT/LONG cells will get me the data from latest quarter. I think it should be, but I want to be certain.
Alternatively, I could rerun the original table so that instead of 1 and NA, I use a descending value per quarter (ex. Q4 2023 = 20 or NA; Q3 2023 = 19 or NA)? But then I am still unsure of how to take the max of a list of columns.
I don't want to unpivot those quarter columns because the dataset now is approx. 3 millions rows and I am trying to keep it manageable.
DATA
structure(list(UNIQUEID = c("8F5", "W7T", "9F2", "8F5", "9F2",
"W7T"), LAT = c(51L, 53L, 51L, 51L, 58L, 53L), LONG = c(-8L,
-6L, -8L, -9L, -8L, -6L), Q4.2023 = c(1L, 1L, NA, NA, NA, NA),
Q3.2023 = c(NA, NA, 1L, 1L, 1L, 1L), Q2.2023 = c(NA, NA,
NA, 1L, 1L, NA)), class = "data.frame", row.names = c(NA,
-6L))