R - how to find out people whose event 1 disappeared after event 2 happened

44 Views Asked by At

I have a dataset in which each person has multiple visits and Event1 and Event2 were documented at each visit (1=yes, 0=no). Not everyone has Event1 or Event2 happened. How do I pick out people whose Event1 disappeared after or at the same time their Event2 occurred (Event2 only needs to occur once)?

ID <-  c(1,1,1,1,1,2,2,2,3,3,4,4,4,4,5,5,5,6,6,6,6,6,6)
Visit <- c(1,2,3,4,5,1,2,3,1,2,1,2,3,4,1,2,3,1,2,3,4,5,6)
Event1 <- c(0,1,0,0,0,1,0,0,1,1,0,1,1,0,0,0,0,0,0,0,1,0,0) 
Event2 <- c(0,1,1,1,1,0,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1)
df <- data.frame(ID, Visit, Event1, Event2)

In this case, ID=1,2,4 had Event1 going away after/at the same time Event2 happened. So I want to keep these 3 people and all their observations.

1

There are 1 best solutions below

2
r2evans On

base R

We can use ave for this grouping op.

ind <- with(df, ave(seq_len(nrow(df)), ID, FUN = function(i) { 
  any(Event1[i] > 0) &&
    any(cumsum(Event2[i] > 0) & c(!Event1[i][-1] > 0, TRUE))
})) > 0
ind
#  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
df[ind,]
#    ID Visit Event1 Event2
# 1   1     1      0      0
# 2   1     2      1      1
# 3   1     3      0      1
# 4   1     4      0      1
# 5   1     5      0      1
# 6   2     1      1      0
# 7   2     2      0      1
# 8   2     3      0      0
# 11  4     1      0      0
# 12  4     2      1      0
# 13  4     3      1      1
# 14  4     4      0      1

Most of the pain here is the inner logic (which is shared with the dplyr solution below), but one possibly-annoying nuance to keep in mind with ave is that the return value is always the same class as the first argument. That is, even if FUN returns logical, since the first argument is integer, the return value is cast/coerced to integer.

Normally ave has some form of "data" as its first argument, but since we need to reference two vectors (two columns in the frame), we need to instead use the row number so that we can reference more than one column inside the FUNction.

dplyr

This perhaps reads a little more easily.

library(dplyr)
df |>
  arrange(ID, Visit) |>
  filter(
    any(Event1 > 0) &&
      any(cumany(Event2 > 0) & lead(!Event1 > 0, default=TRUE)),
    .by = ID)
#    ID Visit Event1 Event2
# 1   1     1      0      0
# 2   1     2      1      1
# 3   1     3      0      1
# 4   1     4      0      1
# 5   1     5      0      1
# 6   2     1      1      0
# 7   2     2      0      1
# 8   2     3      0      0
# 9   4     1      0      0
# 10  4     2      1      0
# 11  4     3      1      1
# 12  4     4      0      1

Note that the .by= requires dplyr_1.1.0 or newer; if you have an older version, change to

df |>
  arrange(ID, Visit) |>
  group_by(ID) |>
  filter(
    any(Event1 > 0) &&
      any(cumany(Event2 > 0) & lead(!Event1 > 0, default=TRUE))
  ) |>
  ungroup()

(same results)


Data

df <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 5), Visit = c(1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 1, 2, 3, 4, 1, 2, 3), Event1 = c(0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0), Event2 = c(0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1)), class = "data.frame", row.names = c(NA, -17L))