Use grepl() to match two elements of a string with separators with two column values in a dataframe simultaneously

80 Views Asked by At

I have a dataframe as shown below and the desired output is shown below.


df <- data.frame(
  col1 = c("abc_1_102", "abc_1_103", "xyz_1_104")
)


selection <- data.frame(col1 =c("abc", "xyz"),col2 =c("102", "106"))

Desired output

       col1      col9
1 abc_1_102 SELECT
2 abc_1_103 NOTSELECT
3 xyz_1_104 NOTSELECT

How can we achieve this using grepl() function in R

What I have tried?

df$col2 <- ifelse(grepl(paste("^", selection$col1, "$", collapse = "|"), df$col1)&
                          grepl(paste("^", selection$col2, "$", collapse = "|"), df$col1),
                        "SELECT", "NOTSELECT")


print(df)

       col1      col2
1 abc_1_102 NOTSELECT
2 abc_1_103 NOTSELECT
3 xyz_1_104 NOTSELECT

Here, the result is incorrect as in the selection the values in column1 and column2 match the first and third element of value in row 1 of df.

4

There are 4 best solutions below

0
Mark On BEST ANSWER

A more complete example would be preferable, but from what I understand, you want to see if there are any of the rows of selection (with a separator of _1_) in df, with the column being "SELECT" when TRUE and "NOTSELECT" otherwise.

df$col9 <- ifelse(grepl(paste(selection$col1, selection$col2, sep = ".*", collapse = "|"), df$col1), "SELECT", "NOTSELECT")

Notes:

  1. If the separator is important to you, then you can replace .* with _1_
  2. Instead of "SELECT" and "NOTSELECT", use TRUE and FALSE. It's shorter, simpler, harder to mess up, and all round a better way of doing things
3
r2evans On

Assuming that _ is a safe separator of the multiple fields in df$col1, we can do this:

ifelse(
  grepl(paste(sprintf("(^|_)%s_.*_%s(_|$)", selection$col1, selection$col2),
              collapse = "|"), df$col1),
        "SELECT", "NOTSELECT")
# [1] "SELECT"    "NOTSELECT" "NOTSELECT"

The (^|_) matches either a preceding _ (assuming there is something to the left of that underscore) or the beginning of string. Similarly, (_|$) for end-of-string. I'd normally use boundaries (e.g., \\b), but underscore does not match that, so no boundaries are found.

0
Till On

You could separate() df$col1 into two columns and then use these to left_join() the selection dataframe.

library(tidyverse)

selection$select <- "SELECT"

res <-
  df |>
  separate(
    col1,
    into = c("char", "num"),
    sep = "_1_",
    remove = FALSE
  ) |>
  left_join(selection, join_by(char == col1, num == col2)) |>
  replace_na(list(select = "NOTSELECT"))

res
#>        col1 char num    select
#> 1 abc_1_102  abc 102    SELECT
#> 2 abc_1_103  abc 103 NOTSELECT
#> 3 xyz_1_104  xyz 104 NOTSELECT

# To make the result exactly as specified:
res |>
  select(col1, col9 = select)
#>        col1      col9
#> 1 abc_1_102    SELECT
#> 2 abc_1_103 NOTSELECT
#> 3 xyz_1_104 NOTSELECT
0
ThomasIsCoding On

You can try interaction with sep = ".*" to build the patterns in grepl, e.g.,

transform(
    df,
    col9 = paste0(
        ifelse(grepl(
            paste0(interaction(selection, sep = ".*"), collapse = "|"),
            col1
        ), "", "NOT"),
        "SELECT"
    )
)

which gives

       col1      col9
1 abc_1_102    SELECT
2 abc_1_103 NOTSELECT
3 xyz_1_104 NOTSELECT