I want to flag the first date in every window of at least 31 days for each id in my data.
Data:
library(tidyverse)
library(lubridate)
library(tibbletime)
D1 <- tibble(id = c(12,12,12,12,12,12,10,10,10,10),
index_date=c("2019-01-01","2019-01-07","2019-01-21","2019-02-02",
"2019-02-09","2019-03-06","2019-01-05","2019-02-01","2019-02-02","2019-02-08"))
D1
# A tibble: 10 x 2
id index_date
<dbl> <chr>
1 12 2019-01-01
2 12 2019-01-07
3 12 2019-01-21
4 12 2019-02-02
5 12 2019-02-09
6 12 2019-03-06
7 10 2019-01-05
8 10 2019-02-01
9 10 2019-02-02
10 10 2019-02-08
The desired rows to flag are rows 1, 4, 6, 7, and 10; these rows represent either the first index_date for a given id or the first index_date after a 31-day skip period from the previously flagged index_date for that given id.
Code:
temp <- D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start"),
keep = index_date == keyed_to_index_date)
temp %>% arrange(desc(id), index_date)
Result:
id index_date keyed_to_index_date keep
<dbl> <date> <date> <lgl>
1 12 2019-01-01 2019-01-01 TRUE
2 12 2019-01-07 2019-01-01 FALSE
3 12 2019-01-21 2019-01-01 FALSE
4 12 2019-02-02 2019-02-02 TRUE
5 12 2019-02-09 2019-02-02 FALSE
6 12 2019-03-06 2019-03-06 TRUE
7 10 2019-01-05 2019-01-05 TRUE
8 10 2019-02-01 2019-02-01 TRUE
9 10 2019-02-02 2019-02-01 FALSE
10 10 2019-02-08 2019-02-01 FALSE
Why does this code flag row 8 (which has an index_date less than 31 days after the previously flagged index_date for that id) and not row 10, and how do I fix this problem?
UPDATE:
Adding the option start_date = first(index_date) to collapse_index(), as suggested by @mnaR99, successfully flagged the correct rows in the original example. However, when I applied the same principle to new data, I ran into a problem:
Data:
D2 <- tibble(id = c("A","A","A","B","B","B","B","B","C","C","C"),
index_date = c("2019-03-04","2019-03-05","2019-03-06",
"2019-03-01","2019-03-02","2019-03-04","2019-03-05","2019-03-06",
"2019-03-03","2019-03-04","2019-03-05"))
D2
id index_date
<chr> <chr>
1 A 2019-03-04
2 A 2019-03-05
3 A 2019-03-06
4 B 2019-03-01
5 B 2019-03-02
6 B 2019-03-04
7 B 2019-03-05
8 B 2019-03-06
9 C 2019-03-03
10 C 2019-03-04
11 C 2019-03-05
I now want to apply a 2-day window in the same manner as I previously applied a 31-day window (that is, consecutive calendar days should not both be flagged). The desired rows to flag are Rows 1, 3, 4, 6, 8, 9, and 11, because these rows are either the first `index_date` for a particular `id` or the first after a two-day skip.
Code:
t3 <- D2 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date,
period = '2 d',
side = "start",
start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(id, index_date)
Result:
> t3
# A time tibble: 11 x 4
# Index: index_date
# Groups: id [3]
id index_date keyed_to_index_date keep
<chr> <date> <date> <lgl>
1 A 2019-03-04 2019-03-04 TRUE
2 A 2019-03-05 2019-03-04 FALSE
3 A 2019-03-06 2019-03-06 TRUE
4 B 2019-03-01 2019-03-01 TRUE
5 B 2019-03-02 2019-03-01 FALSE
6 B 2019-03-04 2019-03-04 TRUE
7 B 2019-03-05 2019-03-05 TRUE
8 B 2019-03-06 2019-03-05 FALSE
9 C 2019-03-03 2019-03-03 TRUE
10 C 2019-03-04 2019-03-03 FALSE
11 C 2019-03-05 2019-03-05 TRUE
Row 7 is incorrectly flagged as TRUE, and Row 8 is incorrectly flagged as FALSE.
When I apply the purrr solution suggested by @tmfmnk, I get the correct result.
Code:
t4 <-
D2 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 |
accumulate(c(0, diff(index_date)), ~ if_else(.x >= 2,
.y,
.x + .y)
) >= 2
)
Result:
> t4
# A tibble: 11 x 3
# Groups: id [3]
id index_date keep
<chr> <date> <lgl>
1 A 2019-03-04 TRUE
2 A 2019-03-05 FALSE
3 A 2019-03-06 TRUE
4 B 2019-03-01 TRUE
5 B 2019-03-02 FALSE
6 B 2019-03-04 TRUE
7 B 2019-03-05 FALSE
8 B 2019-03-06 TRUE
9 C 2019-03-03 TRUE
10 C 2019-03-04 FALSE
11 C 2019-03-05 TRUE
What is wrong with the tibbletime approach in this example?
One option utilizing
dplyr,lubridateandpurrrcould be: