Regex to extract words except optional final word

76 Views Asked by At

I need a regex to extract building names from a list. I'm passing the text and a regex to a framework that does the parsing, so I really want to try to solve this with a regex, not code.

The building name is always all caps and preceded by "Building:" then followed by any of (a) a number, (b) the word "UNIT" in all caps, or (c) any mixed case word. Thus I want to get BUILDING ONE as the result from all of the following except the last row, which should return nothing:

Building: BUILDING ONE 15 [building name followed by unit number]
Optional preceding text Building: BUILDING ONE 15 [preceding text, then building name followed unit number]
Building: BUILDING ONE UNIT 15 [building name followed by word UNIT and unit number]
Building: BUILDING ONE Floor 2 [building name followed by mixed case word]
Grounds: OPEN SPACE WEST Section 3 [not a building - return nothing]

I feel like I know this, but having a brain block. The closest I am right now is ^.*Building:\s([A-Z+\s*]*).* which for the samples above returns

BUILDING ONE
BUILDING ONE
BUILDING ONE UNIT
BUILDING ONE F

The application doing the parsing is written in Python, but as mentioned above, I'm just passing in the regex and data.

2

There are 2 best solutions below

2
Nick On BEST ANSWER

You could use this regex:

(?<=Building: )[A-Z]+(?: (?!UNIT\b)[A-Z]+\b)*

This matches:

  • (?<=Building: ) : positive lookbehind for Building:
  • [A-Z]+ : an uppercase word
  • (?: (?!UNIT\b)[A-Z]+\b)* : zero or more uppercase words that are not UNIT

Demo on regex101

If you're using a flavour of regex that doesn't support lookbehinds, you could use this similar regex, which captures the building name in group 1:

\bBuilding: ([A-Z]+(?: (?!UNIT\b)[A-Z]+\b)*)

Demo on regex101

1
Cary Swoveland On

You can match the regular expression

\bBuilding: +((?:[A-Z]+ )*[A-Z]+)(?<!\bUNIT)(?= +(?:(?:UNIT +)?\d|(?![A-Z]+\b)[a-zA-Z]+ +\d))

The building name, if there is a match, is held in capture group 1.

Demo


If the regex engine supports \K (start the match at the current location and discard all previously-consumed characters), one can write

\bBuilding: +\K(?:[A-Z]+ )*[A-Z]+(?<!\bUNIT)(?= +(?:(?:UNIT +)?\d|(?![A-Z]+\b)[a-zA-Z]+ +\d))

in which case the building name will be matched (i.e., there is no capture group).

Demo


It appears that rule that the building name can be followed by a mixed-case word cannot be fully implemented. Consider, for example, the text

"Building: BUILDING ONE FLOOR 2"

As "FLOOR" is not mixed case BUILDING ONE" cannot be identified as the building name. However, because the building name could be followed by a unit number we would conclude that the building name is "BUILDING ONE FLOOR". The best we could do is to say the building name can be followed by a word that contains a lowercase letter (followed by a digit), which is what I have done.


The regular expression can be broken down as follows.

\bBuilding:[ ]+   # match 'Building:' preceded by a word boundary
                  # and followed by one or more spaces
(                 # begin capture group 1
  (?:[A-Z]+ )     # match one or more uppercase letters followed by a space
                  # in a non-capture group
  *               # match the above non-capture group zero or more times
  [A-Z]+          # match one or more uppercase letters
)                 # end capture group 1
(?<!\bUNIT)       # negative lookbehind asserts that the current location cannot
                  # be preceded by 'UNIT' preceded by a word boundary
(?=               # begin positive lookahead 
  [ ]+            # match one or more spaces
  (?:             # begin non-capture group
    (?:UNIT +)?   # optionally ('?') match 'UNIT' followed by one or more spaces
    \d            # match a digit
  |               # or
    (?![A-Z]+\b)  # negative lookahead asserts that the current location is
                  # not followed by one or more uppercase letters
    [a-zA-Z]+ +   # match one or more letters followed by one or more spaces
    \d            # match a digit
  )               # end non-capture group
)                 # end positive lookahead

In the above I put some space characters in a character class [ ] to make them visible.