Extracting named groups and non_named groups with fuzzy regex

45 Views Asked by At

I'm searching for a way to apply a regex to a texts and extract its values in the form of a dictionary.
The Groups in the regexes can be named, non-named or mixed.
Ideally, I would use fuzzy matching (Allow some errors in the text).

Text example: Name: foo BaR; Age: 42
Regex Example: Name: (?<name>[a-z]+) (?<lastname>[A-Z]+); Age: (\d+)'
Expected Output: {name: foo, lastname: BaR, gr0: 42}

With the question, I also post my answer below
If there is a better way, I would be happy to take it ;)

Cheers :)

1

There are 1 best solutions below

2
Marc Moreaux On

So this is what I use so far.

  • 1st: modify the regex such that every group is a named-group
  • 2nd: add an error possibility to the regex (with {e<=3})
  • 3rd: use regex.search(...).capturedict() to extract dict of named-groups

text="Name: foo BaR; Age: 42"
pattern = r'Name: (?<name>[a-z]+) (?<lastname>[A-Z]+); Age: (\d+)'

def name_groups_in_regex(pattern):
    '''Make sure that all the groups in the regex are named-groups
    Replace a non-named-group by "gr<idx>"
    '''
    # Pattern to get non-named group
    get_parenthesis_pattern = r"(?<!\\)\((?!\?)"

    # Count matches
    n_parenthesis = len(re.findall(get_parenthesis_pattern, pattern))
    
    # substitute non-named group with named group
    pattern = re.sub(get_parenthesis_pattern, "(?<gr%d>", pattern)
    pattern = pattern % tuple(i for i in range(n_parenthesis))

    return pattern

# Name the groups in the regex
pattern = name_groups_in_regex(pattern)

# Perform fuzzy matching with an overall maximum of 3 'errors'
fuzzy_pattern = f'({pattern}){{e<=3}}'
regex.search(fuzzy_pattern, text, regex.BESTMATCH).capturesdict()

Output: {'name': ['foo'], 'lastname': ['BaR'], 'gr0': ['42']}