I have a regex pattern which represents a valid variable name in a language I'm trying to parse:
R'\b([A-Z][A-Z0-9_]{0,35}\b' (e.g. VAR_NAME, TABLE_READ, SOME_OTHER_VAR etc..)
However, I don't want to capture this pattern if it's:
- From a list of keywords: OR, AND, IF, THEN, ELSE, READ_GENERIC_TABLE
- Preceded by a semicolon (;) because that starts an in-line comment in this language
- Immediately surrounded by double quotes (e.g. VAR_NAME is valid but "VAR_NAME" is not)
So far, using a third party regex module (https://github.com/mrabarnett/mrab-regex v2023.8.8), I've been able to come up with the following:
R'(?!\bOR\b|\bAND\b|\bIF\b|\bTHEN\b|\bELSE\b|\bREAD_GENERIC_TABLE\b)(?<!;.*)\b([A-Z][A-Z0-9_]{0,35})\b'
But I can't quite figure out how to handle the double quotes part (#3). Would anyone be able to provide me with some direction?
I've been running against the following text as a test:
test_string = """ IF t = 0 OR t > POL_TERM_M THEN 0 ELSE IF mult(t+11,12) AND POLICY_YEAR <= TBL_VAL_INT_Y THEN READ_GENERIC_TABLE(TBL_VAL_INT, "Y", ENTRY_YEAR, MVA_OPT, "VAR_NAME", POLICY_YEAR(t)) ELSE VALINT_EL_PC(t-1) """
Which should capture the set of variable names:
POL_TERM_M, POLICY_YEAR, TBL_VAL_INT_Y, TBL_VAL_INT, ENTRY_YEAR, MVA_OPT and VALINT_EL_PC
I tried the following expression but it then started capturing the keywords:
R'(?!\bOR\b|\bAND\b|\bIF\b|\bTHEN\b|\bELSE\b|\bREAD_GENERIC_TABLE\b)(?<!;.*)[^"]\b([A-Z][A-Z0-9_]{0,35})\b[^"]'
You may need another negative-looking assertion to exclude the double quotes part. I test with the code below, and it returns correctly.