The problem I am looking to get insight on is regarding counting totals and subtotals in R in non-overlapping regular periods going backward in time.
Generating this example data frame,
library(lubridate)
set.seed(1)
dates <- ymd(parse_date("2023-12-31") - days(0:199))
colors <- c('red', 'blue')
sample_colors <- sample(colors, 200, replace = TRUE)
objects <- c('pen', 'marker', 'brush')
sample_objects <- sample(objects, 200, replace = TRUE)
test_df <- data.frame(dates, sample_colors, sample_objects)
I end up with something like this:
> head(test_df)
dates sample_colors sample_objects
1 2023-06-15 red pen
2 2023-06-16 red marker
3 2023-06-17 blue pen
4 2023-06-18 red pen
5 2023-06-19 blue pen
6 2023-06-20 blue brush
Say I want to count the number of the least common color-object grouping every 60 days (e.g. blue brush, red pen, etc.) going backwards. I'll start without the requirement to go backwards and turn to the slider package. I would proceed with
period_count <- function(data) {
summarise(data,
earliest_day_of_period = min(floor_date(dates, "day")),
latest_day_of_period = max(floor_date(dates, "day")),
sub_total = n(),
.by = c(sample_colors, sample_objects)
) %>%
summarise(
earliest_day_of_period = first(earliest_day_of_period),
latest_day_of_period = first(latest_day_of_period),
day_span = first(latest_day_of_period) - first(earliest_day_of_period),
min_object_n = min(sub_total)
)
}
test_df <- test_df %>% arrange(dates)
test_df_period_counts <- slide_period_dfr(.x = test_df,
.i = test_df$dates,
.period = "day",
.f = period_count,
.every = 60,
.complete = TRUE
)
(I know across() is being deprecated by the way, so apologies for the outdated usage. Just haven't gotten experienced with the reframe() syntax yet.)
The slide_period_dfr operation gets me this,
> test_df_period_counts
earliest_day_of_period latest_day_of_period day_span min_object_n
1 2023-08-14 2023-09-15 32 days 4
2 2023-09-20 2023-11-10 51 days 6
3 2023-11-19 2024-01-10 52 days 4
4 2024-01-18 2024-02-24 37 days 6
which leaves me confused, as no period appears 60 days long, and there are two shorter partial periods bookending the middle, longer ones. Perhaps the way that I am fetching my first and final days of each period are incorrect.
But, assuming the dates presented above are correct, what I want to do is to instead start complete period formation from the last day going backward, such that the partial period/window is in the beginning rather than at the end. I want a full 60 day period going backward from "2024-02-29", and so forth, until a full period is no longer possible, and then it should be discarded. Also, even though I have set the .complete argument in slide_period_dfr to TRUE, operations are still happening on partial-sized periods (which is all of them!). If anyone could let me know what I am doing wrong to compel calculations to still occur for these partial windows, please let me know.
So, in summary, is it possible to do exact period creation with the slider package in the backward direction, or will I need to go with a different package and implementation method? Thank you for your help.
I think there are two things going on here:
everyvalue is > 1period_count()function has some errors. If yougroup_by(sample_colors, sample_objects)when calculating the date span for entire period, you'll get different dates for each group, and the apparent length of the period will be calculated from the min and max dates for the object with the fewest appearances, rather than all the objects in the period.But we can simplify the
period_count()function somewhat:Together with supplying the
.originthis produces non-overlapping periods starting from the end: