Regex includes Lookahead strings in selection

77 Views Asked by Papa Analytica At 03 September 2021 at 09:18

I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.

Here is the link to the sample excel file with 2 of those echo reports.

The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.

I wrote the following pattern:

pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
                               ignore_case = FALSE)

Now, let's look at the results (remember I want the "Mild" part not the "LV" part):

str_view_all(df$echo, pattern)

As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv inside a positive lookahead (?= ( lv)?) construct.

Anyone knows what am I doing wrong?

Original Q&A

There are 2 best solutions below

Wiktor Stribiżew On 03 September 2021 at 09:25 BEST ANSWER

The problem is that \w+ matches any one or more word chars, and the lookahead does not consume the chars it matches (the regex index remains where it was).

So, the LV gets matched with \w+ as there is diastolic dysfunction right after it, and ( lv)? is an optional group (there may be no space+lv right before diastolic dysfunction) for the \w+ to match).

If you do not want to match LV, add a negative lookahead to restrict what \w+ matches:

\b(?!lv\b)\w+\b(?=(?:\s+lv)?\s+d(?:[iy]a|i)stolic d[yi]sfunction)

See the regex demo

Also, note that [iy] is a better way to write (i|y).

In R, you may define it as

pattern <- regex(
    "\\b(?!lv\\b)\\w+\\b(?=(?:\\s+lv)?\\s+d(?:[iy]a|i)stolic\\s+d[yi]sfunction)",
    ignore_case = FALSE
)

The fourth bird On 03 September 2021 at 09:28

Using \w+ can also match LV and the lv part is optional.

Instead of a lookahead, you can also use a capture group.

\b(?!lv)(\w+)\b (?:lv )?(?:d[iy]astolic|distolic) d[iy]sfunction

regex demo

Regex includes Lookahead strings in selection

There are 2 best solutions below

Related Questions in R

Related Questions in REGEX

Related Questions in REGEX-LOOKAROUNDS

Related Questions in POSITIVE-LOOKAHEAD

Trending Questions

Popular # Hahtags

Popular Questions