Entity Linking with spacy/Wikipedia

3.1k Views Asked by At

I am trying to follow the example here: https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking. But I am just confused as to what is in the training data. Is it everything from Wikipedia? Say I just need training data on a few entities. For example, E1, E2, and E3. Does the example allow for me to specify only a few entities that I want to disambiguate?

1

There are 1 best solutions below

8
Sofie VL On

[UPDATE] Note that this code base was moved to https://github.com/explosion/projects/tree/master/nel-wikipedia (spaCy v2)

If you run the scripts as provided in https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking, they will indeed create a training dataset from Wikipedia you can use to train a generic model on.

If you're looking to train a more limited model, ofcourse you can feed in your own training set. A toy example can be found here: https://github.com/explosion/spaCy/blob/master/examples/training/train_entity_linker.py, where you can deduce the format of the training data:

def sample_train_data():
    train_data = []

    # Q2146908 (Russ Cochran): American golfer
    # Q7381115 (Russ Cochran): publisher

    text_1 = "Russ Cochran his reprints include EC Comics."
    dict_1 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
    train_data.append((text_1, {"links": dict_1}))

    text_2 = "Russ Cochran has been publishing comic art."
    dict_2 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
    train_data.append((text_2, {"links": dict_2}))

    text_3 = "Russ Cochran captured his first major title with his son as caddie."
    dict_3 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
    train_data.append((text_3, {"links": dict_3}))

    text_4 = "Russ Cochran was a member of University of Kentucky's golf team."
    dict_4 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
    train_data.append((text_4, {"links": dict_4}))

    return train_data

This example in train_entity_linker.py shows you how the model learns to disambiguate "Russ Cochran" the golfer (Q2146908) from the publisher (Q7381115). Note that it is just a toy example: a realistic application would require a larger knowledge base with accurate prior frequencies (as you can get by running the Wikipedia/Wikidata scripts), and ofcourse you would need many more sentences and lexical variety to expect the Machine Learning model to pick up proper clues and generalize efficiently to unseen text.