How can I Prioritize Overlapping Patterns in RegEx?

1k Views Asked by At

I've seen several questions similar, even one i posted myself, but this is rather specific.

In regex there is a match pattern. Now say in the same string there are two match patterns that can both match text. It seems my luck always leans towards the regex matching the wrong pattern. (I am using the .Net Regex in C#)

I have two types of strings that I need to break down:

01 - First Value|02 - Second Value|Blank - Ignore

And:

A - First ValueblankB - Second ValueC - Third Value

So my desired result is to match Code to Meaning with one pattern string

Code,Meaning
01,First Value
02,Second Value
Blank,Ignore
A,First Value
blank,
B,Second Value
C,Third Value

I have tried several patterns but can never seem to quite get it right. The closest I have have been able to get is:

(([A-Z0-9]{1,4})[ \-–]{1,3}|([Bb]lank)[ \-–]{0,3})(([A-Z][a-z]+[.,;| ]?)+)

My breakdown:

  • [A-Z0-9]{1,4}[ \-–]{1,3} --> this matches the code, Upper case, or number of length 1 - 4 characters followed by 1 to 3 chars of space, hyphen, or mdash from html.

or

  • [Bb]lank[ \-–]{0,3} --> blank followed 0-3 chars of space, hyphen, or mdash from html

then

  • (([A-Z][a-z]+[.,;| ]?)+) --> should match any multiple word including possible space. so the First and Value, Second and Value should be matched.

The initial problem with that is the final pattern group matches the "Valueblank" in the second input string. I want to somehow prioritize that "[Bb]lank" should be matched as part of the first group and NEVER part of the second group.
I tried putting a (?![Bb]lank) negative lookahead in the finalgroup but it never seems to work. Any help would be appreciated.

Thanks

Jaeden "Sifo Dyas" al'Raec Ruiner

2

There are 2 best solutions below

5
Phil Young On BEST ANSWER

How about the following (regex101.com example):

/((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)(?:\h[-–]\h|\|)?(.*?)(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)/gm

Explanation

[Bb]lank

All matches for "blank" check for a lower OR uppercase "B"

((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)

The 1st capturing group: match either the alpha numeric first value or a "blank" first value with " - " or " – " after (positive lookahead) OR a "blank" first value that won't have a 2nd matching group.

(?:\h[-–]\h|\|)?

A separator of " - " OR " – " OR "|" which will occur zero or one times.

(.*?)

Ungreedily match the 2nd matching group.

(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)

Using a positive lookahead,look for a "blank" OR "|" OR alpha numeric first value with " - " or " – " after OR the end of the line (to catch the last item on the row) to find the end of where we should capture

1
SoronelHaetir On

Regex will pick the first longest match, that is if two patterns start matching at the same position and match the same number of characters the earlier alternative will be chosen.

for example, the following (silly example) will always match the first alternative in preference to the second: (.+)|foo

In your case if you actually want to match two items where one starts with a number and one with a letter, why not do: ([0-9]+...)|([A-Za-z]....)

Match the two alternates as early as possible.