`group_by` changes column index for usage in `across` in dplyr

46 Views Asked by At

Probably this is an intended behaviour, but I had not noticed it before. If you have several columns you want to change at once with dplyr's across. And you want to refer to the colums by their index - this index changes if you group the dataframe by a variable.

What I mean is the following:

lets say we have this dataframe

df = data.frame(
  group = c("a", "a", "b"),
  val1 = 1:3,
  val2 = 2:4,
  val3 = 3:5
) 

df_grouped = df %>% 
  group_by(group)

And then we want to change columns 2 to 4 (val1 to val3) we can do this:

df %>% 
  mutate(across(2:4, ~"changed"))

and the result is:

  group    val1    val2    val3
1     a changed changed changed
2     a changed changed changed
3     b changed changed changed

however, when I do the same on the grouped dataframe I get this:

Error in `mutate()`:
ℹ In argument: `across(2:4, ~"changed")`.
Caused by error in `across()`:
! Can't subset columns past the end.
ℹ Location 4 doesn't exist.
ℹ There are only 3 columns.

So I have to do this

df_grouped %>% 
  mutate(across(1:3, ~"changed"))

So as I can tell it just takes the grouping-column out. Is there any way to prevent that?

2

There are 2 best solutions below

0
r2evans On

What you're seeing is the consequence of using integers for column indices. Though that method works on ungrouped data, it's not mentioned in the ?group_by

     ...: In ‘group_by()’, variables or computations to group by.
          Computations are always done on the ungrouped data frame. To
          perform computations on the grouped data, you need to use a
          separate ‘mutate()’ step before the ‘group_by()’.
          Computations are not allowed in ‘nest_by()’. In ‘ungroup()’,
          variables to remove from the grouping.

or in the vignettes. The references to using group_by use variables or computations to do what you need.

The reason is because the grouping is assumed to be on mutable columns of the frame, and in df_group, the group column is immutable because it's the current group. As a demonstration, we can know what is the apparent number of columns in data with:

df |>
  reframe(nc = ncol(pick(everything())))
#   nc
# 1  4
df_grouped |>
  reframe(nc = ncol(pick(everything())))
# # A tibble: 2 × 2
#   group    nc
#   <chr> <int>
# 1 a         3
# 2 b         3

Because of this, your use of 2:4 is beyond the number of columns.

Ways to deal with this:

  1. use variable names, as in

    mutate(df_grouped, across(val1:val3, ~"changed"))
    
  2. ungroup before you mutate, which works here because your operation is fairly benign, but realize that you're losing the "by-group" logic of whatever you're really doing:

    ungroup(df_grouped) |>
      mutate(across(2:4, ~"changed"))
    
  3. A small bit of a hack, but if you really don't want to code in the column names, you can do

    df_grouped %>%
      mutate(across(all_of(names(.)[2:4]), ~"changed"))
    
0
G. Grothendieck On

One workaround is to duplicate the grouping column and then remove it later. This also works with |> .

df %>%
  group_by(group2 = group) %>%
  mutate(across(2:4, ~"changed")) %>%
  ungroup %>%
  select(-group2)

giving

# A tibble: 3 × 4
  group val1    val2    val3   
  <chr> <chr>   <chr>   <chr>  
1 a     changed changed changed
2 a     changed changed changed
3 b     changed changed changed