Intersect a vector with a dataframe column that has multiple values in tidyverse

55 Views Asked by At

Suppose I have these variables in R:

vals <- c("b", "c")
foo <- data.frame(x=c("a|b", "b|c", "c|d", "e|f|g"))

I'd like another column in foo that has the number of items from vals, e.g.

> foo2
      x y
1   a|b 1
2   b|c 2
3   c|d 1
4 e|f|g 0

1 because "a|b" has "b", 2 because "b|c" has "b" and "c", etc.

How do I do that with tidyverse functions?

I can split x, but the intersection isn't working. A couple of failed attempts:

library(dplyr)
library(magrittr)

> foo2 <- foo %>% mutate(x1=str_split(x, "\\|"), y=intersect(vals, x1))
Error in `mutate()`:
ℹ In argument: `y = intersect(vals, x1)`.
Caused by error:
! `y` must be size 4 or 1, not 0.
> foo2 <- foo %>% mutate(x1=str_split(x, "\\|"), y=intersect(vals, x1[[1]]))
> foo2
      x      x1 y
1   a|b    a, b b
2   b|c    b, c b
3   c|d    c, d b
4 e|f|g e, f, g b
4

There are 4 best solutions below

0
Gregor Thomas On BEST ANSWER

You need to map (or lapply) your intersect to apply it separately to each row:

library(purrr)

foo |>
  mutate(
    xsplit = strsplit(x, split = "|", fixed = TRUE),
    intersect = map(xsplit, intersect, vals),
    y = lengths(intersect)
  )
#       x  xsplit intersect y
# 1   a|b    a, b         b 1
# 2   b|c    b, c      b, c 2
# 3   c|d    c, d         c 1
# 4 e|f|g e, f, g           0
0
SamR On

Counts

As we have different numbers of values in different rows of x (e.g. 2 in the first few rows, then 3), we can use tidyr::separate_longer_delim() to put these into long form, compare with y with the %in% operator and then sum by group (which is just the row number).

library(dplyr)
library(tidyr)
foo  <- foo  |>
    mutate(id = row_number())

counts  <- foo  |> 
    separate_longer_delim(x, delim = "|")  |>
    mutate(in_vals = x %in% vals)  |>
    group_by(id)  |>
    summarise(y = sum(in_vals))

left_join(foo, counts, by = "id")  |>
    select(-id)    
#       x y
# 1   a|b 1
# 2   b|c 2
# 3   c|d 1
# 4 e|f|g 0

Intersect

I wasn't sure this was part of the question, rather than an intermediate step, but if it is then you can create it with paste():

counts2  <- foo  |> 
    separate_longer_delim(x, delim = "|")  |>
    mutate(in_vals = x %in% vals)  |>
    group_by(id)  |>
    summarise(
        y = sum(in_vals),
        xy_intersect = paste(x[in_vals], collapse = ", ")
    )


left_join(foo, counts2, by = "id")  |>
    select(-id)    

#       x y xy_intersect
# 1   a|b 1            b
# 2   b|c 2         b, c
# 3   c|d 1            c
# 4 e|f|g 0             

Alternatively if you want the intersection to be a list rather than a character vector, which I think would be generally preferable, you can create a list column with xy_intersect = list(x[in_vals]).

1
Onyambu On

You could directly use str_count:

foo %>% 
  mutate(y = str_count(x, str_c(vals, collapse = '|')))

      x y
1   a|b 1
2   b|c 2
3   c|d 1
4 e|f|g 0

If the words are more than one character long, use boundaries: ie

mutate(foo, y = str_count(x, str_c("\\b", vals, "\\b", collapse = "|")))

in base R use:

transform(foo, y = rowSums(sapply(vals, grepl, x)))
      x y
1   a|b 1
2   b|c 2
3   c|d 1
4 e|f|g 0
0
Mark On

Another way!

library(tidyverse)

mutate(foo, x = str_split(x, "\\|", simplify = TRUE) %>%
        {. %in% vals} %>%
        sum(), .by = x)