How to compute a confusion matrix using spaCy's Scorer/Example classes?

149 Views Asked by At

I am trying to calculate the Accuracy and Specificity of a NER model using spaCy's API. The scorer.scores(example) method found here computes the Recall, Precision and F1_Score for the spans predicted by the model, but does not allow for the extrapolation of TP, FP, TN, or FN.

Below is the code I have currently written, with an example of the data structure I am using when passing my expected found entites into the model.

Code Being Used to Score the Model:

import spacy
from spacy.scorer import Scorer
from spacy.training.example import Example

scorer = Scorer()
example = []
for obs in example_list:
    print('Input for a prediction:', obs['full_text'])
    pred = custom_nlp(obs['full_text'])  ## custom_nlp is the custome model I am using to generate docs
    print('Predicted based off of input:', pred, '// Entities being reviewed:', obs['entities'])
    temp = Example.from_dict(pred, {'entities': obs['entities']})
    example.append(temp)
scores = scorer.score_spans(example, "ents")

The data structure I am currently using to load the Example class (list of dictionaries): example_list[0] {'full_text': 'I would like to remove my kid Florence from the will. How do I do that?', 'entities': [(30, 38, 'PERSON')]}

The result that I am returning from running print(scores) is as expected; a dictionary of tokenization's precision, recall, f1_score, as well as the entity recognition's precision, recall and f1_score.

{'ents_p': 0.8731019522776573,
 'ents_r': 0.9179019384264538,
 'ents_f': 0.8949416342412452,
 'ents_per_type': {'PERSON': {'p': 0.9039145907473309,
   'r': 0.9694656488549618,
   'f': 0.9355432780847145},
  'GPE': {'p': 0.7973856209150327,
   'r': 0.9384615384615385,
   'f': 0.8621908127208481},
  'STREET_ADDRESS': {'p': 0.8308457711442786,
   'r': 0.893048128342246,
   'f': 0.8608247422680412},
  'ORGANIZATION': {'p': 0.9565217391304348,
   'r': 0.7415730337078652,
   'f': 0.8354430379746837},
  'CREDIT_CARD': {'p': 0.9411764705882353, 'r': 1.0, 'f': 0.9696969696969697},
  'AGE': {'p': 1.0, 'r': 1.0, 'f': 1.0},
  'US_SSN': {'p': 1.0, 'r': 1.0, 'f': 1.0},
  'DOMAIN_NAME': {'p': 0.4, 'r': 1.0, 'f': 0.5714285714285715},
  'TITLE': {'p': 0.8709677419354839, 'r': 0.84375, 'f': 0.8571428571428571},
  'PHONE_NUMBER': {'p': 0.8275862068965517,
   'r': 0.8275862068965517,
   'f': 0.8275862068965517},
  'EMAIL_ADDRESS': {'p': 1.0, 'r': 1.0, 'f': 1.0},
  'DATE_TIME': {'p': 1.0, 'r': 1.0, 'f': 1.0},
  'NRP': {'p': 1.0, 'r': 1.0, 'f': 1.0},
  'IBAN_CODE': {'p': 1.0, 'r': 1.0, 'f': 1.0},
  'IP_ADDRESS': {'p': 0.75, 'r': 0.75, 'f': 0.75},
  'ZIP_CODE': {'p': 0.8333333333333334,
   'r': 0.7142857142857143,
   'f': 0.7692307692307692},
  'US_DRIVER_LICENSE': {'p': 1.0, 'r': 1.0, 'f': 1.0}}}

How can I extrapolate the TP, FP, TN and FN from this function using some form of an attribute?

1

There are 1 best solutions below

2
V12 On

Copied from https://github.com/explosion/spaCy/discussions/12682#discussioncomment-6036758:

This isn't currently supported by the provided scorers (you haven't overlooked any built-in options), but you can replace the default scorer with your own custom registered scoring method in the scorer setting in the config.

Here are what the basics look like when you define a custom scorer (this example just modifies the names of the returned keys):

spaCy/spacy/tests/test_language.py Lines 188-199

   def custom_textcat_score(examples, **kwargs): 
        scores = Scorer.score_cats( 
            examples, 
            "cats", 
            multi_label=False, 
            **kwargs, 
        ) 
        return {f"custom_{k}": v for k, v in scores.items()} 
     
    @spacy.registry.scorers("test_custom_textcat_scorer") 
    def make_custom_textcat_scorer(): 
        return custom_textcat_score 

You'd usually provide your custom scorer with -c code.py for spacy train, spacy evaluate, spacy package, etc.

The current NER scorer is here:

spaCy/spacy/scorer.py Lines 750-792

    def get_ner_prf(examples: Iterable[Example], **kwargs) -> Dict[str, Any]: 
        """Compute micro-PRF and per-entity PRF scores for a sequence of examples.""" 
        score_per_type = defaultdict(PRFScore) 
        for eg in examples: 
            if not eg.y.has_annotation("ENT_IOB"): 
                continue 
            golds = {(e.label_, e.start, e.end) for e in eg.y.ents} 
            align_x2y = eg.alignment.x2y 
            for pred_ent in eg.x.ents: 
                if pred_ent.label_ not in score_per_type: 
                    score_per_type[pred_ent.label_] = PRFScore() 
                indices = align_x2y[pred_ent.start : pred_ent.end] 
                if len(indices): 
                    g_span = eg.y[indices[0] : indices[-1] + 1] 
                    # Check we aren't missing annotation on this span. If so, 
                    # our prediction is neither right nor wrong, we just 
                    # ignore it. 
                    if all(token.ent_iob != 0 for token in g_span): 
                        key = (pred_ent.label_, indices[0], indices[-1] + 1) 
                        if key in golds: 
                            score_per_type[pred_ent.label_].tp += 1 
                            golds.remove(key) 
                        else: 
                            score_per_type[pred_ent.label_].fp += 1 
            for label, start, end in golds: 
                score_per_type[label].fn += 1 
        totals = PRFScore() 
        for prf in score_per_type.values(): 
            totals += prf 
        if len(totals) > 0: 
            return { 
                "ents_p": totals.precision, 
                "ents_r": totals.recall, 
                "ents_f": totals.fscore, 
                "ents_per_type": {k: v.to_dict() for k, v in score_per_type.items()}, 
            } 
        else: 
            return { 
                "ents_p": None, 
                "ents_r": None, 
                "ents_f": None, 
                "ents_per_type": None, 
            } 

I asked the moderators in spacy with this: Reference:https://github.com/explosion/spaCy/discussions/12682