I'm trying to extract the degree (Mild/Moderate/Severe) of an specific type heart dysfunction (diastolic dysfunction) from a huge number of echo reports.
Here is the link to the sample excel file with 2 of those echo reports.
The lines are usually expressed like this: "Mild LV diastolic dysfunction" or "Mild diastolic dysfunction". Here, "Mild" is what I want to extract.
I wrote the following pattern:
pattern <- regex("(\\b\\w+\\b)(?= (lv )?(d(i|y)astolic|distolic) d(y|i)sfunction)",
ignore_case = FALSE)
Now, let's look at the results (remember I want the "Mild" part not the "LV" part):
str_view_all(df$echo, pattern)
As you can see in strings like "Mild diastolic dysfunction" the pattern correctly selects "Mild", but when it comes to "Mild LV diastolic dysfunction" pattern selects "LV" even though I have brought the lv inside a positive lookahead (?= ( lv)?) construct.
Anyone knows what am I doing wrong?
The problem is that
\w+matches any one or more word chars, and the lookahead does not consume the chars it matches (the regex index remains where it was).So, the
LVgets matched with\w+as there isdiastolic dysfunctionright after it, and( lv)?is an optional group (there may be no space+lvright beforediastolic dysfunction) for the\w+to match).If you do not want to match
LV, add a negative lookahead to restrict what\w+matches:See the regex demo
Also, note that
[iy]is a better way to write(i|y).In R, you may define it as