I just ran into some weird behavior of dplyr where summarize kept referring to objects from a previous group.
Here is a simple reproducible example to illustrate the surprising behavior:
library(dplyr, warn.conflicts = FALSE)
tibble(x = rep(letters[1:3], times = 4),
y = rnorm(12)) %>%
group_by(x) %>%
summarize(z1 = sum(y),
z2 = {
attr(y, "test") <- "test"
sum(y)
})
#> # A tibble: 3 × 3
#> x z1 z2
#> <chr> <dbl> <dbl>
#> 1 a 0.602 0.602
#> 2 b 1.22 0.602
#> 3 c -0.310 0.602
Created on 2022-10-31 by the reprex package (v2.0.1)
I expected that z1 and z2 are identical. I don't understand why setting an attribute for the vector y means that in later iterations, the reference to the ''correct'' elements of y is shadowed.
The problem can be easily fixed by using sum(.data$y) in the last line, but I would like to understand the scoping rules within the non-standard evaluation of summarize. Any pointers to helpful documentation or explanations why the current behavior makes sense in the tidyverse non-standard evaluation framework makes sense is appreciated.
I am using R 4.1.1 with dplyr 1.0.7.
This is a problem related to scoping. If you write to the variable
yinsidesummarize, then the first grouping of your data'syvariable is copied into a local variable calledythat is distinct from theyin your data frame. Because it is a local variable, it is found on the search path before theyin the passed data frame. Since the same environment is used for subsequent groups' calculations insidesummarize, this local variable persists for each group.We can see this if we do:
As long as we remove the local copy of the
yvariable from the local frame, this doesn't happen:Or better still, don't write to a local variable with the same name as a variable in your data frame:
Created on 2022-10-31 with reprex v2.0.2