Python regex to match pattern if not in double quotes or equal to list of keywords

55 Views Asked by At

I have a regex pattern which represents a valid variable name in a language I'm trying to parse:

R'\b([A-Z][A-Z0-9_]{0,35}\b' (e.g. VAR_NAME, TABLE_READ, SOME_OTHER_VAR etc..)

However, I don't want to capture this pattern if it's:

  1. From a list of keywords: OR, AND, IF, THEN, ELSE, READ_GENERIC_TABLE
  2. Preceded by a semicolon (;) because that starts an in-line comment in this language
  3. Immediately surrounded by double quotes (e.g. VAR_NAME is valid but "VAR_NAME" is not)

So far, using a third party regex module (https://github.com/mrabarnett/mrab-regex v2023.8.8), I've been able to come up with the following:

R'(?!\bOR\b|\bAND\b|\bIF\b|\bTHEN\b|\bELSE\b|\bREAD_GENERIC_TABLE\b)(?<!;.*)\b([A-Z][A-Z0-9_]{0,35})\b'

But I can't quite figure out how to handle the double quotes part (#3). Would anyone be able to provide me with some direction?

I've been running against the following text as a test:

test_string = """ IF t = 0 OR t > POL_TERM_M THEN 0 ELSE IF mult(t+11,12) AND POLICY_YEAR <= TBL_VAL_INT_Y THEN READ_GENERIC_TABLE(TBL_VAL_INT, "Y", ENTRY_YEAR, MVA_OPT, "VAR_NAME", POLICY_YEAR(t)) ELSE VALINT_EL_PC(t-1) """

Which should capture the set of variable names: POL_TERM_M, POLICY_YEAR, TBL_VAL_INT_Y, TBL_VAL_INT, ENTRY_YEAR, MVA_OPT and VALINT_EL_PC

I tried the following expression but it then started capturing the keywords:

R'(?!\bOR\b|\bAND\b|\bIF\b|\bTHEN\b|\bELSE\b|\bREAD_GENERIC_TABLE\b)(?<!;.*)[^"]\b([A-Z][A-Z0-9_]{0,35})\b[^"]'

1

There are 1 best solutions below

1
Arist12 On

You may need another negative-looking assertion to exclude the double quotes part. I test with the code below, and it returns correctly.

import re

test_string = """ IF t = 0 OR t > POL_TERM_M THEN 0 ELSE IF mult(t+11,12) AND POLICY_YEAR <= TBL_VAL_INT_Y THEN READ_GENERIC_TABLE(TBL_VAL_INT, "Y", ENTRY_YEAR, MVA_OPT, "VAR_NAME", POLICY_YEAR(t)) ELSE VALINT_EL_PC(t-1) """

pattern = r'\b(?!OR\b|AND\b|IF\b|THEN\b|ELSE\b|READ_GENERIC_TABLE\b)(?<!;)(?<!")([A-Z][A-Z0-9_]{0,35})\b'

matches = re.findall(pattern, test_string)
print(set(matches))