R - windowing data backwards from latest data in non-overlapping (not-rolling) periods and counting within periods

59 Views Asked by At

The problem I am looking to get insight on is regarding counting totals and subtotals in R in non-overlapping regular periods going backward in time.

Generating this example data frame,

library(lubridate)

set.seed(1)

dates <- ymd(parse_date("2023-12-31") - days(0:199))

colors <- c('red', 'blue')
sample_colors <- sample(colors, 200, replace = TRUE)
objects <- c('pen', 'marker', 'brush')
sample_objects <- sample(objects, 200, replace = TRUE)

test_df <- data.frame(dates, sample_colors, sample_objects)

I end up with something like this:

> head(test_df)
       dates sample_colors sample_objects
1 2023-06-15           red            pen
2 2023-06-16           red         marker
3 2023-06-17          blue            pen
4 2023-06-18           red            pen
5 2023-06-19          blue            pen
6 2023-06-20          blue          brush

Say I want to count the number of the least common color-object grouping every 60 days (e.g. blue brush, red pen, etc.) going backwards. I'll start without the requirement to go backwards and turn to the slider package. I would proceed with

period_count <- function(data) {
  summarise(data, 
            earliest_day_of_period = min(floor_date(dates, "day")),
            latest_day_of_period = max(floor_date(dates, "day")),
            sub_total = n(),
            .by = c(sample_colors, sample_objects)
  ) %>%
    summarise(
      earliest_day_of_period = first(earliest_day_of_period),
      latest_day_of_period = first(latest_day_of_period),
      day_span = first(latest_day_of_period) - first(earliest_day_of_period),
      min_object_n = min(sub_total)
    )
}

test_df <- test_df %>% arrange(dates)

test_df_period_counts <- slide_period_dfr(.x = test_df,
                                         .i = test_df$dates,
                                         .period = "day",
                                         .f = period_count,
                                         .every = 60,
                                         .complete = TRUE
                                         )

(I know across() is being deprecated by the way, so apologies for the outdated usage. Just haven't gotten experienced with the reframe() syntax yet.)

The slide_period_dfr operation gets me this,

> test_df_period_counts
  earliest_day_of_period latest_day_of_period day_span min_object_n
1             2023-08-14           2023-09-15  32 days            4
2             2023-09-20           2023-11-10  51 days            6
3             2023-11-19           2024-01-10  52 days            4
4             2024-01-18           2024-02-24  37 days            6

which leaves me confused, as no period appears 60 days long, and there are two shorter partial periods bookending the middle, longer ones. Perhaps the way that I am fetching my first and final days of each period are incorrect.

But, assuming the dates presented above are correct, what I want to do is to instead start complete period formation from the last day going backward, such that the partial period/window is in the beginning rather than at the end. I want a full 60 day period going backward from "2024-02-29", and so forth, until a full period is no longer possible, and then it should be discarded. Also, even though I have set the .complete argument in slide_period_dfr to TRUE, operations are still happening on partial-sized periods (which is all of them!). If anyone could let me know what I am doing wrong to compel calculations to still occur for these partial windows, please let me know.

So, in summary, is it possible to do exact period creation with the slider package in the backward direction, or will I need to go with a different package and implementation method? Thank you for your help.

2

There are 2 best solutions below

4
SamR On BEST ANSWER

I think there are two things going on here:

  1. You need to set an origin date. From the docs, This is generally used to define the anchor time to count from, which is relevant when the every value is ⁠> 1⁠
  2. Your function period_count() function has some errors. If you group_by(sample_colors, sample_objects) when calculating the date span for entire period, you'll get different dates for each group, and the apparent length of the period will be calculated from the min and max dates for the object with the fewest appearances, rather than all the objects in the period.

But we can simplify the period_count() function somewhat:

period_count  <- function(dat) {
    dat |>
        add_count(sample_colors, sample_objects, name = "sub_total") |>
        summarise(
            earliest_day_of_period = min(dates),
            latest_day_of_period = max(dates),
            day_span = latest_day_of_period - earliest_day_of_period,
            min_object_n = min(sub_total)
        ) 
}

Together with supplying the .origin this produces non-overlapping periods starting from the end:

slider::slide_period_dfr(
    test_df,
    .i = test_df$dates,
    .period = "day",
    .f = period_count,
    .every = 60,
    .complete = TRUE,
    .origin = max(test_df$dates) +1
)

#   earliest_day_of_period latest_day_of_period day_span min_object_n
# 1             2023-08-15           2023-09-03  19 days            1
# 2             2023-09-04           2023-11-02  59 days            5
# 3             2023-11-03           2024-01-01  59 days            6
# 4             2024-01-02           2024-03-01  59 days            6
1
zephryl On

Here’s an alternative approach that doesn’t use slider:

library(dplyr)
library(tidyr)

test_df %>%
  mutate(period = cut(
    dates,
    seq(max(dates), min(dates), by = "-60 days"),
    right = TRUE
  )) %>%
  drop_na(period) %>%
  mutate(
    earliest_day_of_period = min(dates),
    latest_day_of_period = max(dates),
    day_span = latest_day_of_period - earliest_day_of_period + 1,
    .by = period
  ) %>%
  summarize(
    min_object_n = min(table(sample_colors, sample_objects)),
    .by = c(earliest_day_of_period, latest_day_of_period, day_span)
  )
#>   earliest_day_of_period latest_day_of_period day_span min_object_n
#> 1             2023-11-02           2023-12-31  60 days            6
#> 2             2023-09-03           2023-11-01  60 days            6
#> 3             2023-07-05           2023-09-02  60 days            5

Created on 2024-03-01 with reprex v2.0.2