I am trying to get the strsplit and stri_extract_all_regex to work consistently. Splitting by strsplit is the desired behavior (in practice I have a long string of alternative matches that is generated dynamically by the code).
For example
strsplit("1234567","(23)|(234)")
[[1]]
[1] "1" "567"
vs
stri_extract_all_regex("1234567","(23)|(234)")
[[1]]
[1] "23"
The desired output from extraction is
[[1]]
[1] "234"
Regex engines: PCRE vs ERE vs ICU
Base R uses Extended Regular Expressions (ERE) by default, or Perl Compatible Regular Expressions (PCRE) if specified with
perl = TRUE. Conversely, thestringidocs state it:As noted in Why the order matters in this regex with alternation, the order matters since that is the order which the Regex engine will try to match. This applies to the .NET regex engine in that question, and we can see below it also applies to PCRE and ICU, but not ERE, which is why you get a different result with
strsplit()(unless you setperl = TRUE).As this answer to that question states:
So we need to switch the pattern so the preferred (i.e. longest) match is first:
pattern <- "(234)|(23)".Arranging your pattern so the longest match is first
As your pattern is dynamically generated, you can't just hardcode the order. However, you can write a function to sort it by the longest pattern first.
This assumes the only operator is
"|"but it can be extended if there are others.This will now work with any of the regex engines: