How to delete repeated rows for specific values?

62 Views Asked by At

This is my first question here, so I apologize in advance if I'm doing it wrong. I have a dataset with 20,000 observations and a dummy variable (0, 1). I want to delete rows with repeated values, but only for value of 1. I.e., if I have repeated 0s I want to keep them all. But if I have repeated 1s, I want to keep only the first one. I also want to do this sorting by group. Is it possible?

This is how my data looks now:

This is how my data looks now:

In this excerpt, I would like to keep all data from 1920 until 1922, drop the rows from 1923 to 1929, and keep the remaining observations.

This is what I have tried so far, but it drops all observations after the first 1, including rows with values o 0.

df %>%
  arrange(country, year) %>% 
  group_by(country) %>% 
  slice(if(1 %in% event) seq(match(1, event)) else row_number()) %>% 
  ungroup()

Thank you!

3

There are 3 best solutions below

1
jpsmith On

In base R you could use a modified sequence(rle(...)) approach to identify consecutive instances of event, with some additional logic to meet your specific needs:

df[!(df$event == 1 & 
       sequence(rle(as.character(df$event))$lengths) > 1),]

#     country year event
# 1 Argentina 1920     0
# 2 Argentina 1921     0
# 3 Argentina 1922     1
# 4 Argentina 1930     0
# 5 Argentina 1931     0
# 6 Argentina 1932     0
# 7 Argentina 1933     0
# 8 Argentina 1934     0
# 9 Argentina 1935     0

# or the dplyr approach
library(dplyr)
df %>% filter(!(event == 1 & sequence(rle(as.character(event))$lengths) > 1))

If you wanted to do it by country, you'd likely want to use the dplyr approach with .by = country:

library(dplyr)
df %>%
  filter(!(event == 1 & sequence(rle(as.character(event))$lengths) > 1),
         .by = country)

#     country year event
# 1 Argentina 1920     0
# 2 Argentina 1921     0
# 3 Argentina 1922     1
# 4 Argentina 1930     0
# 5 Argentina 1931     0
# 6 Argentina 1932     0
# 7 Argentina 1933     0
# 8 Argentina 1934     0
# 9 Argentina 1935     0
0
r2evans On

We can use run-length encoding with dplyr::consecutive_id(country, event). This processing assumes that the data is sorted by year at least, and keeps track of anytime the set of variables provided to it (country, event) changes. It is similar to group_by but the big difference is that this honors splits induced by intermediate rows, whereas group_by does not (group_by does more strict grouping, regardless of order).

df %>%
  arrange(year) %>%
  group_by(country, event_rle = consecutive_id(country, event)) %>%
  filter(first(event) != 1 | row_number() == 1L) %>%
  ungroup()
# # A tibble: 9 × 4
#   country    year event event_rle
#   <chr>     <int> <int>     <int>
# 1 Argentina  1920     0         1
# 2 Argentina  1921     0         1
# 3 Argentina  1922     1         2
# 4 Argentina  1930     0         3
# 5 Argentina  1931     0         3
# 6 Argentina  1932     0         3
# 7 Argentina  1933     0         3
# 8 Argentina  1934     0         3
# 9 Argentina  1935     0         3

I kept the event_rle around for demonstration, it is no longer needed.


Data

df <- structure(list(country = c("Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Argentina"), year = 1920:1935, event = c(0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -16L))
0
TarJae On

Here is another dplyr approach (data taken from @r2evans, many thanks): We filter - bind - filter and arrange. :-)

library(dplyr)

df %>% 
  filter(event == 1 & row_number() == 1, .by=event) %>% 
  bind_rows(df %>% 
              filter(event == 0)) %>% 
  arrange(year)

 country year event
1 Argentina 1920     0
2 Argentina 1921     0
3 Argentina 1922     1
4 Argentina 1930     0
5 Argentina 1931     0
6 Argentina 1932     0
7 Argentina 1933     0
8 Argentina 1934     0
9 Argentina 1935     0