I'm with a fundamental question in R:
Considering that I have a data frame, where each column represent the set of nucleotide mutations into two samples 'major' and 'minor'
major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")
df <- data.frame(major,minor)
tibble(df)
#A tibble: 1 x 2
major minor
<chr> <chr>
1 T2A,C26T,G652A T2A,C26T,G652A,C725T
And I want to identify the mutations present in 'minor' that aren't in 'major'.
I know that if those 'major' and 'minor' mutations were stored vectors, I could use setdiff to get this difference, but, the data that I received is stored as a long string with some mutations separated by comma, and I don't know how transform this column string to a column vector in the data frame to get this difference (I tried without success).
using the setdiff directly in the columns:
setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"
The expected results was:
C725T
Could anyone help me?
Best,
This works on a multi-row data frame, doing comparisons by row:
Note that it does modify the
majorandminorcolumns, turning them into list columns containing character vectors within each row. You can use the.namesargument toacrossif you need to keep the originals.