How to remove or drop duplicates based on two column (ID and quarter)? R Studio

59 Views Asked by At

My dataframe2 looks like this

dataframe2 = structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), 
                            Quarterly = c(2006.1, 2006.1, 2006.1, 2006.2, 2006.3, 2006.4, 2006.1, 2006.2, 2006.3, 
                                          2006.3, 2006.4, 2006.1, 2006.1, 2006.1, 2006.2, 2006.2, 2006.3, 2006.4), 
                            Status = c("Employed", "Employed", "Employed", "Employed", "Null", "Employed", "Employed",
                                       "Employed", "Employed", "Employed", "Employed", "Null", "Null", "Employed", 
                                       "Employed", "Employed", "Employed", "Employed")), 
                       class = "data.frame", row.names = c(NA, -18L))
ID   Quarterly    Status 
1    2006.1     Employed 
1    2006.1     Employed 
1    2006.1     Employed 
1    2006.2     Employed 
1    2006.3     Null 
1    2006.4     Employed 
2    2006.1     Employed 
2    2006.2     Employed 
2    2006.3     Employed 
2    2006.3     Employed 
2    2006.4     Employed 
3    2006.1     Null 
3    2006.1     Null 
3    2006.1     Employed 
3    2006.2     Employed 
3    2006.2     Employed 
3    2006.3     Employed 
3    2006.4     Employed 

I'm keen for it to look like this, so that there is only one observation per ID

ID   Quarterly    Status 
1    2006.1.    Employed 
1    2006.2     Employed 
1    2006.3     Null 
1    2006.4     Employed 
2    2006.1     Employed 
2    2006.2     Employed 
2    2006.3     Employed 
2    2006.4     Employed 
3    2006.1     Null 
3    2006.2     Employed 
3    2006.3     Employed 
3    2006.4     Employed 

I've tried a couple of options that I had found on this website, and none have worked as hoped. I did: (a) group_by(ID,Quarterly) %>% filer(n()>1) and (b) group_by(ID,Quarterly) %>% distinct(ID, keep.all = TRUE). Option (b) just removed all the Quarterly values except 2023.1 (the latest my dates go to).

2

There are 2 best solutions below

0
thomas On BEST ANSWER

The way to do it with dplyr is using summarise

result <- dataframe2  |> summarise(.by = c("ID", "Quarterly"), first(Status))

For simplicity, I took the function first to summarize the group over ID and Quaterly

0
AnilGoyal On

You may use dplyr::slice_head for the purpose of selecting one row (first in this case) per group passed in as by argument.

library(dplyr)

dataframe2 %>% 
  slice_head(n = 1, by = c(ID, Quarterly))
#>    ID Quarterly   Status
#> 1   1    2006.1 Employed
#> 2   1    2006.2 Employed
#> 3   1    2006.3     Null
#> 4   1    2006.4 Employed
#> 5   2    2006.1 Employed
#> 6   2    2006.2 Employed
#> 7   2    2006.3 Employed
#> 8   2    2006.4 Employed
#> 9   3    2006.1     Null
#> 10  3    2006.2 Employed
#> 11  3    2006.3 Employed
#> 12  3    2006.4 Employed

Created on 2024-03-05 with reprex v2.0.2