I'm trying to determine which parts of a string match a specific named capture group, using stringi and R (and thus ICU regex). However, if the named capture group is the first child of an unnamed capture group, the name is lost in the output.
The contrived example is the following, the real is much more complex:
library(stringi)
stri_locate_all_regex("ab", "((?<letterone>[a-z])(?<lettertwo>[a-z]))", capture_groups = TRUE)
#> [[1]]
#> start end
#> [1,] 1 2
#> attr(,"capture_groups")
#> attr(,"capture_groups")[[1]]
#> start end
#> [1,] 1 2
#>
#> attr(,"capture_groups")[[2]]
#> start end
#> [1,] 1 1
#>
#> attr(,"capture_groups")$lettertwo
#> start end
#> [1,] 2 2
We see that capture group 2 appears to correspond to the named capture group letterone (it matches the first letter only), however, the name is lost in the output.
If it's not the first item in a capture group, it returns the expected output, even if the first item is a no-op, e.g. a{0}.
stri_locate_all_regex("ab", "(a{0}(?<letterone>[a-z])(?<lettertwo>[a-z]))", capture_groups = TRUE)
#> [[1]]
#> start end
#> [1,] 1 2
#> attr(,"capture_groups")
#> attr(,"capture_groups")[[1]]
#> start end
#> [1,] 1 2
#>
#> attr(,"capture_groups")$letterone
#> start end
#> [1,] 1 1
#>
#> attr(,"capture_groups")$lettertwo
#> start end
#> [1,] 2 2
Is there a way to extract named capture groups regardless of their position? And is this a known phenomenon I just don't know about, or a bug?
The named capture group support is followed in
gagolews/stringiissue 153 and should work as expected, assumingstringi1.7 or more recent (Q3 2021).As a possible workaround, I would try and avoid nesting named capture groups within unnamed groups if you are observing this kind of behavior.
For example:
This will directly create two named capture groups without nesting them within an unnamed group.
However, if nesting is required for your actual, more complex use case, this may not be feasible.
Given the limitation in the
stringilibrary due to its reliance on ICU, which as of the last information had not yet implemented stable support for named capture groups, you might consider using unique identifiers for capture groups and post-process the results to map them to names.You can use unnamed capture groups in your regex and post-process the results to map group numbers to names.
This involves keeping a separate mapping of group numbers to names and applying this mapping after the match is made.
That would perform a regex match with unnamed capture groups, and then extracts the capture groups and maps them to names using a separate
capture_group_namesvector.The final result is stored in a named list
named_results, where each element of the list corresponds to a named capture group.In your actual use case, you would have a much more complex pattern and a longer
capture_group_namesvector to match the structure of your regex.