Get difference between column strings in R dataframe

758 Views Asked by heuristic At 24 May 2022 at 17:39

I'm with a fundamental question in R:

Considering that I have a data frame, where each column represent the set of nucleotide mutations into two samples 'major' and 'minor'

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- data.frame(major,minor)
tibble(df)

#A tibble: 1 x 2
  major          minor               
  <chr>          <chr>               
1 T2A,C26T,G652A T2A,C26T,G652A,C725T

And I want to identify the mutations present in 'minor' that aren't in 'major'.

I know that if those 'major' and 'minor' mutations were stored vectors, I could use setdiff to get this difference, but, the data that I received is stored as a long string with some mutations separated by comma, and I don't know how transform this column string to a column vector in the data frame to get this difference (I tried without success).

using the setdiff directly in the columns:

setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"

The expected results was:

C725T

Could anyone help me?

Best,

Original Q&A

There are 2 best solutions below

Gregor Thomas On 24 May 2022 at 18:05 BEST ANSWER

This works on a multi-row data frame, doing comparisons by row:

library(dplyr)
major <- c("T2A,C26T,G652A", "world")
minor <- c("T2A,C26T,G652A,C725T", "hello,world")

df <- data.frame(major,minor)

df %>%
  mutate(
    across(c(major, minor), strsplit, split = ",")
  ) %>%
  mutate(
    diff = mapply(setdiff, minor, major)
  )
#              major                   minor  diff
# 1 T2A, C26T, G652A T2A, C26T, G652A, C725T C725T
# 2            world            hello, world hello

Note that it does modify the major and minor columns, turning them into list columns containing character vectors within each row. You can use the .names argument to across if you need to keep the originals.

Angel F. Escalante On 24 May 2022 at 17:44

Easiest way to do this; define major and minor as character vector

major <- c("T2A", "C26T", "G652A")

and

minor <- c("T2A", "C26T", "G652A", "C725T")

then

df <- tibble(major, minor)
setdiff(df$minor, df$major)
#> "C725T"

If not possible to split major and minor as character vector, you can use stringr package to do that job.

library(stringr)

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- tibble(
  major = str_split(major, pattern = ",", simplify = TRUE), 
  minor = str_split(minor, pattern = ",", simplify = TRUE)
)

setdiff(df$minor, df$major)
#> "C725T"

Get difference between column strings in R dataframe

There are 2 best solutions below

Related Questions in R

Related Questions in STRING

Related Questions in DATAFRAME

Related Questions in SET-DIFFERENCE

Trending Questions

Popular # Hahtags

Popular Questions