How to get original token position in string from Stanza constituency parse tree?

250 Views Asked by At

I am using Stanza to extract noun phrases from texts. I am using this code to extract the NPs and store them according to their depth.

nlp = stanza.Pipeline('en', tokenize_pretokenized=True)
sentence_tokens = ['This', 'is', 'a', 'sentence', '.']
doc = nlp(sentence_tokens)
for sent in doc.sentences:
    tree = sent.constituency

    def extract_NPs(tree, np_dict):
        for child in tree.children:
            if child.label=='NP':
                np_dict[child.depth()].append(child)
            np_dict = extract_NPs(child, np_dict)
        return np_dict
    nps = extract_NPs(tree, np_dict=defaultdict(list))

The output dictionary has the depth as the key, and a list of NP trees with that depth. Each NP is a Tree, described in the Stanza github here.

I have combed over the code and documentation, and I cannot seem to find a way to map the text of the NPs back to the position in the original input sentence. Simply finding the index of a token in the sentence_tokens doesn't work for me as many of these sentences have repeat tokens.

Any ideas?

1

There are 1 best solutions below

0
Profio On

You can use replace_words() to replace each word in the constituency parse with the word's id before processing the tree object (new words still have to be a string):

tree = tree.replace_words(map(str, range(len(sentence_tokens))))

Then you can recover the word ids for a given NP tree using leaf_labels(). E.g., calling leaf_labels() on the root will now return:

['0', '1', '2', '3', '4']