Conditional RegEx to match prefix (and or) suffix but not a word with neither

799 Views Asked by At

In the hope of preventing someone from wasting their time offering an alternative solution I have to use regular expressions for this task.

I am trying to write a regular expression to match a base word that has the prefix "<" (AND OR) the suffix ">" but NOT to match if the base word has neither prefix nor suffix.

This is not a simple case of matching either a "<" or a ">" as this character may change or be part of a group.

Example.

For this example the group of base words are (base|text|word) in real life this list could be quite long.

Out of these candidates in a the input text file...

text
<text
text>
<text>

...I want to match the following...

<text
text>
<text>

...but NOT match...

text

In spoken English my RegEx is looking for any of the base words prefixed with a "<" (AND OR) suffixed with ">" but not to match the base word if it has neither prefix/suffix.

As mentioned above it is not a case of matching a literal "<" or a ">" as these characters may be different or part of a group.

Out of all the attempts I have made I cannot get this to work without catching the base word if it appears alone without a prefix or suffix.

As I became increasingly flustered while working on this problem I failed to retain all my previous attempts. My efforts will be of little value to anyone here as they all failed and when I ran out of ideas I ended up guessing.

The following are some examples.

(text) = This will catch "text"

(\<)(text) = This will catch "<text"

(text)(/>) = This will catch "text>"

(\<)(text)(/>) = This will catch "<text>"

(\<|)(text)(|/>) = This is the closest as it will catch "<text" "text>" "<text>" but it will also catch "text".

I have also experimented with look-around and look-behind but I was not able to look-behind and jump over the base word to see if there was a prefix.

The only workaround is to use 2 RegEx. The first Looks for (\<)(text) and the second looks for (text)(/>) however this means running the RegEx twice which is inefficient and I really want to solve this problem.

I have been provided with a standalone custom executable (windows) to run these RegEx's and I have no idea what RegEx engine it uses but common RegEx commands seem to work ok.

Thank you and any help would be gratefully received.

3

There are 3 best solutions below

0
Wiktor Stribiżew On BEST ANSWER

You can use

(<)?text(?(1)>?|>)

See the regex demo.

Details:

  • (<)? - Group 1 (optional): matches a < optionally
  • text - matches a text string
  • (?(1)>?|>) - a conditional construct: if Group 1 matched an optional > char is matched, else, a > must be matched.

If you need to use word boundaries, use them like in

(<)?\btext\b(?(1)>?|>)
2
Barmar On

Use two alternatives:

<text|text>

this will match <text or text>. It will also match <text> because it contains <text.

This assumes you're just testing whether the string contains a match, not that you're trying to return the matched part. In the latter case, add the other bracket optionally to one of the alternatives:

<text>?|text>

The first alternative matches <text or <text>, the second alternative matches text>.

5
stackunderflow On

My question has been answered.

This RegEx (\<)?text(?(1)\>?|\>) by Wiktor Stribiżew works perfectly.

Thank you all.