Grepl values between comma and append as new column in R

49 Views Asked by At

I have the following dataframe:

df=read.table(text="A     
hwqnewqn,ENS.1kmsdf,jmewhqjwjenj
jjqweq3w,eqwejnqwe,ENS.gkhkgfdlsl.jkmwejre
ENSAAAAAAAAAAAA,bbbbbbbb,cccccccc", header=TRUE)

Let's say I need a new column B with only the string between commas that somehow match ENS

So end result should be:

df=read.table(text="B     
ENS.1kmsdf
ENS.gkhkgfdlsl.jkmwejre
ENSAAAAAAAAAAAA", header=TRUE)   

Is there any approach for that?

3

There are 3 best solutions below

0
jpsmith On

In base R, you could use vapply with strsplit to test for "ENS" using grepl:

df$B <- vapply(strsplit(df$A, ","), \(x) x[grepl("ENS", x)], as.character(1L))

#                                            A                       B
# 1           hwqnewqn,ENS.1kmsdf,jmewhqjwjenj              ENS.1kmsdf
# 2 jjqweq3w,eqwejnqwe,ENS.gkhkgfdlsl.jkmwejre ENS.gkhkgfdlsl.jkmwejre
# 3          ENSAAAAAAAAAAAA,bbbbbbbb,cccccccc         ENSAAAAAAAAAAAA
0
B. Christian Kamgang On

You can use gsub:

df$B <- gsub(".*(ENS[^,]*).*", "\\1", df$A)

#                                            A                       B
# 1           hwqnewqn,ENS.1kmsdf,jmewhqjwjenj              ENS.1kmsdf
# 2 jjqweq3w,eqwejnqwe,ENS.gkhkgfdlsl.jkmwejre ENS.gkhkgfdlsl.jkmwejre
# 3          ENSAAAAAAAAAAAA,bbbbbbbb,cccccccc         ENSAAAAAAAAAAAA
0
r2evans On

Your data looks like CSV, so assuming you are reading it in from a file, you can skip the first line (with the lone A) and grab the rest:

read.table(text="A     
hwqnewqn,ENS.1kmsdf,jmewhqjwjenj
jjqweq3w,eqwejnqwe,ENS.gkhkgfdlsl.jkmwejre
ENSAAAAAAAAAAAA,bbbbbbbb,cccccccc", header=FALSE, sep=",", skip=1)[,2,drop=FALSE] |>
  setNames("B")
#            B
# 1 ENS.1kmsdf
# 2  eqwejnqwe
# 3   bbbbbbbb

Or, if you've already read it in, you can use use read.csv to parse the remaining text:

read.csv(text = paste(df$A, collapse="\n"), header = FALSE)[,2,drop=FALSE] |>
  setNames("B")
#            B
# 1 ENS.1kmsdf
# 2  eqwejnqwe
# 3   bbbbbbbb

One reason that it might be better to use read.csv or the like is if there are quoted fields where doing a simpler string-split or regex might not split the text correctly.