REGEX: Select KeyWord1 if KeyWord2 is in the same string

267 Views Asked by At

I am trying to capture KEYWORD1 in .NET regex engine based on whether KeyWord2 is present in the string. So far the positive look-around solution I am using:

(?=.*KeyWord2)**KEYWORD1** (\m\i)

RegEx Test Link

only captures KEYWORD1 if KeyWord2 is positioned anywhere behind KEYWORD1 in the string. How can I optimize this in regex so that it captures all instances of KEYWORD1 in the string despite the position of KeyWord2 being ahead, behind or both?

I'd really appreciate some insight.

Thank You

2

There are 2 best solutions below

7
AudioBubble On BEST ANSWER

You can use the regex below for your requirement:

\bKEYWORD1\b(?:(?<=\bKeyWord2\b.*?)|(?=.*?\bKeyWord2\b))

Explanation of the above Regular Expression:

gi - Use the flags(in order to avoid any case difference) representing: g - global; i - case-insensitive

\b - Represents a word boundary.

(?:) - Represents a non-capturing group.

(?=.*?KeyWord2) - Represents the positive lookahead which matches all KEYWORD1 which are before KeyWord2 read from left to right.

| - Represents alternation; that is it alternates between 1st and 2nd alternating group.(Although, you can wrap them in group.)

(?<=KeyWord2.*?) - Represents infinite(because non-fixed width lazy identifier .*? used) positive lookbehind which matches all KEYWORD1 which are behind of KeyWord2.

You can find the above regex demo here.

NOTE - For the record, these engines support infinite lookbehind:

As far as I know, they are the only ones.

0
Cary Swoveland On

If one uses a regex engine that supports \G and \K, the following regular expression could be used.

^(?=.*\bKeyWord2\b)|\G.*?\K\bKEYWORD1\b

with the case-indifferent flag and, depending on requirements, multiline flag, set.

PCRE demo

With PCRE (PHP) and some other regex engines the anchor \G matches the end of previous match. For the first match attempt, \G is equivalent to \A, matching the start of the string. See this discussion for details.

\K resets the starting point of the reported match to the current position of the engine's internal string pointer. Any previously consumed characters are not included in the final match. In effect, \K causes the engine to "forget" everything matched up to that point. Details can be found here.

As shown at the link, there are four matches of the string

The KEYWORD1 before KeyWord2 then KEYWORD1 and KEYWORD1 again

They are an empty string at the beginning of the string and each of the three instances of KEYWORD1. In fact for every string matched one of the matches will be an empty string at the beginning of the string. Empty strings must therefore be disregarded when making substitutions.