Regex includes Lookahead strings in selection

77 Views Asked by At

I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.

Here is the link to the sample excel file with 2 of those echo reports.

The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.

I wrote the following pattern:

pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
                               ignore_case = FALSE)

Now, let's look at the results (remember I want the "Mild" part not the "LV" part):

str_view_all(df$echo, pattern)

As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv inside a positive lookahead (?= ( lv)?) construct.

Anyone knows what am I doing wrong?

2

There are 2 best solutions below

10
Wiktor Stribiżew On BEST ANSWER

The problem is that \w+ matches any one or more word chars, and the lookahead does not consume the chars it matches (the regex index remains where it was).

So, the LV gets matched with \w+ as there is diastolic dysfunction right after it, and ( lv)? is an optional group (there may be no space+lv right before diastolic dysfunction) for the \w+ to match).

If you do not want to match LV, add a negative lookahead to restrict what \w+ matches:

\b(?!lv\b)\w+\b(?=(?:\s+lv)?\s+d(?:[iy]a|i)stolic d[yi]sfunction)

See the regex demo

Also, note that [iy] is a better way to write (i|y).

In R, you may define it as

pattern <- regex(
    "\\b(?!lv\\b)\\w+\\b(?=(?:\\s+lv)?\\s+d(?:[iy]a|i)stolic\\s+d[yi]sfunction)",
    ignore_case = FALSE
)
1
The fourth bird On

Using \w+ can also match LV and the lv part is optional.

Instead of a lookahead, you can also use a capture group.

\b(?!lv)(\w+)\b (?:lv )?(?:d[iy]astolic|distolic) d[iy]sfunction

regex demo