Data masking and when I need to use the Pick function

61 Views Asked by At

I am currently working through R for data science and I am confused as to when you use the pick function to deal with data masking.

For example why does this work?

grouped_mean <- function(df, group_var, mean_var) {
 df |> 
   group_by({{ group_var }}) |> 
   summarize(mean({{ mean_var }}))
}

 diamonds %>% 
   grouped_mean(clarity, depth)

But this does not?

count_missing <- function(df, group_vars, x_var) {
  df |> 
    group_by({{ group_vars }}) |>  #this will not work because group by group_by use data masking
  summarize(
    n_miss = sum(is.na({{ x_var }})),
    .groups = "drop"
  )
}

flights |> 
  count_missing(c(year, month, day), dep_time)

But then this will.

count_missing <- function(df, group_vars, x_var) {
  df |> 
    group_by(pick({{ group_vars }})) |>  #here is the difference
    summarize(
      n_miss = sum(is.na({{ x_var }})),
      .groups = "drop"
    )
} 

flights |> 
  count_missing(c(year, month, day), dep_time)

I am just trying to understand the concept of when to use pick() vs {{}} or both.

1

There are 1 best solutions below

1
moodymudskipper On

No need for wrapper functions, this already fails:

library(dplyr)
library(nycflights13)

flights |> 
  group_by(c(year, month, day)) |> 
  summarize(
    n_miss = sum(is.na(dep_time)),
    .groups = "drop"
  )
#> Error in `group_by()`:
#> ℹ In argument: `c(year, month, day)`.
#> Caused by error:
#> ! `c(year, month, day)` must be size 336776 or 1, not 1010328.

See in the doc ?group_by how the ... is documented, you don't see tidy-select as you do with select() and other selecting verbs (among them across() and pick())

group_by() works more like mutate, you're not providing to it column names to be selected, but values that determine a group, so c(year, month, day) is not a tidy selection here but a vector 3 times longer than your df, hence the message.

At a lower level if you want to understand it, pick() will create a data frame from those columns, and group_by() will know how to handle it

flights |> 
  group_by(pick(c(year, month, day))) |> 
  summarize(
    n_miss = sum(is.na(dep_time)),
    .groups = "drop"
  )
#> # A tibble: 365 × 4
#>     year month   day n_miss
#>    <int> <int> <int>  <int>
#>  1  2013     1     1      4
#>  2  2013     1     2      8
#>  3  2013     1     3     10
#>  4  2013     1     4      6
#>  5  2013     1     5      3
#>  6  2013     1     6      1
#>  7  2013     1     7      3
#>  8  2013     1     8      4
#>  9  2013     1     9      5
#> 10  2013     1    10      3
#> # ℹ 355 more rows

The behavior of group_by() is confusing, but who needs group_by() anyway ? The .by arg uses tidy selection

flights |> 
  summarize(
    .by = c(year, month, day),
    n_miss = sum(is.na(dep_time)),
  )
#> # A tibble: 365 × 4
#>     year month   day n_miss
#>    <int> <int> <int>  <int>
#>  1  2013     1     1      4
#>  2  2013     1     2      8
#>  3  2013     1     3     10
#>  4  2013     1     4      6
#>  5  2013     1     5      3
#>  6  2013     1     6      1
#>  7  2013     1     7      3
#>  8  2013     1     8      4
#>  9  2013     1     9      5
#> 10  2013     1    10      3
#> # ℹ 355 more rows