SpaCy: Regex pattern does not work in rule-based matcher

42 Views Asked by At

I am trying to define a regular expression to use as text pattern in the entity ruler component in my spaCy model. The aim is to add tokens with "COMP" label whenever it finds words structured like this:

  • XXX-Ynnn
  • XXX Ynnn Where 'XXX' are trigrams from a list, 'Y' is a letter and 'nnn' a digit combination.

To do so, I use the following method

def add_component_patterns_re(input_references, model_ruler):
    ruler = model_ruler
    ref_patterns = []
    letters = ['V', 'B', 'F', 'K', 'S']

    print("Adding component patterns")
    for ref in input_references.iloc[:, 0]:
        # print(f"Adding references for system: {ref}")
        for letter in letters:
            pattern_text = fr'{ref}(-| ){letter}[0-9]{{3}}'
            pattern = {"TEXT": {"REGEX": fr'{ref}(-| ){letter}[0-9]{{3}}'}}
            ref_patterns.append({"label":"COMP", "pattern":pattern})
    ruler.add_patterns(ref_patterns)

    return ref_patterns

Printing out the added patterns, it seems to me that the output list is correct. So my guess is that I am doing something wrong when defining the pattern to add to the ruler. For information, i've also tried to change the pattern variable as a list entry, like this:

pattern = [{"TEXT": {"REGEX": fr'{ref}(-| ){letter}[0-9]{{3}}'}}]

But the result is the same, it can't seem to get any match.

Does someone have any suggestion? Thanks in advance!

2

There are 2 best solutions below

0
FSic On BEST ANSWER

In the end I got

print(f"Adding references for system: {ref}")
    for letter in letters:
        for nnn in range(1000):
            pattern = f"{ref}-{letter}{nnn:03d}"
            ref_patterns.append({"label": "COMP", "pattern": pattern})
            pattern = f"{ref} {letter}{nnn:03d}"
            ref_patterns.append({"label": "COMP", "pattern": pattern})

For each pattern. The code is lengthier and a tad slower but it does the job just fine!

2
Nauel On

AFAIK in the context of NLP trigrams are meant as a series of N words (3 in this case).

I think {ref} is not needed in this case, take in assume that the value in you cell is love like me-V123 taking {ref} you'll end up with the entire string which is not what you want, since you're only interested with "love like me"

So I would build the following regex, which match your case:

  • \w+\s+\w+\s+\w+-[YVBFKS]\d{3} -> Regex with "-"

  • \w+\s+\w+\s+\w+\s+[YVBFKS]\d{3} -> Regex without "-"

Applying this all together in python I would end up with:

import re

pattern = r"\w+\s+\w+\s+\w+-[YVBFKS]\d{3}"
labeled_tokens = []

for ref in input_references.iloc[:, 0]:
    # Check if the token matches the pattern
    if re.match(pattern, ref ):
        labeled_tokens.append((ref, "COMP"))
    else:
        labeled_tokens.append((ref, None))