I have a column which is filled with strings containing multiple dots. I want to split this column into two containing the two substrings before and after the first dot.
I.e.
comb num
UWEA.n.49.sp 3
KYFZ.n.89.kr 5
...
Into
a b num
UWEA n.49.sp 3
KYFZ n.89.kr 5
...
I'm using the separate function from tidyr but cannot get the regexp correct. I'm trying to use the regex style from this answer:
foo %>%
separate(comb, into=c('a', 'b'),
sep="([^.]+)\\.(.*)")
So that column a should be determined by the first capture group ([^.]+) containing at least one non-dot characters, then the first dot, then the second capture group (.*) just matches whatever remains after.
However this doesn't seem to match anything:
a b num
3
5
Here's my dummy dataset:
library(dplyr)
library(tidyr)
foo <- data.frame(comb=replicate(10,
paste(paste(sample(LETTERS, 4), collapse=''),
sample(c('p', 'n'), 1),
sample(1:100, 1),
paste(sample(letters, 2), collapse=''),
sep='.')
),
num = sample(1:10, 10, replace=T))
I think @aosmith's answer is great and definitely less clunky than a
regexsolution involving lookarounds. But since you're intent on usingregex, here it is:The trick here is the regex itself. It uses what is known as
lookaround. Basically, you are looking for a dot (.) that's placed between an uppercase letter and a lowercase letter (i.e.UWEA.n) for thesepparameter. It means:match a dot preceded by a capital letter and followed by a lowercase letter.This allows the
separatefunction to split thecombcolumn on the dots that are betweenAandnor betweenZandn, in your case.I hope this helps.