I am currently working through R for data science and I am confused as to when you use the pick function to deal with data masking.
For example why does this work?
grouped_mean <- function(df, group_var, mean_var) {
df |>
group_by({{ group_var }}) |>
summarize(mean({{ mean_var }}))
}
diamonds %>%
grouped_mean(clarity, depth)
But this does not?
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |> #this will not work because group by group_by use data masking
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
count_missing(c(year, month, day), dep_time)
But then this will.
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |> #here is the difference
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
count_missing(c(year, month, day), dep_time)
I am just trying to understand the concept of when to use pick() vs {{}} or both.
No need for wrapper functions, this already fails:
See in the doc
?group_byhow the...is documented, you don't see tidy-select as you do withselect()and other selecting verbs (among themacross()andpick())group_by()works more like mutate, you're not providing to it column names to be selected, but values that determine a group, soc(year, month, day)is not a tidy selection here but a vector 3 times longer than your df, hence the message.At a lower level if you want to understand it,
pick()will create a data frame from those columns, andgroup_by()will know how to handle itThe behavior of group_by() is confusing, but who needs group_by() anyway ? The
.byarg uses tidy selection