I am using Stanza to extract noun phrases from texts. I am using this code to extract the NPs and store them according to their depth.
nlp = stanza.Pipeline('en', tokenize_pretokenized=True)
sentence_tokens = ['This', 'is', 'a', 'sentence', '.']
doc = nlp(sentence_tokens)
for sent in doc.sentences:
tree = sent.constituency
def extract_NPs(tree, np_dict):
for child in tree.children:
if child.label=='NP':
np_dict[child.depth()].append(child)
np_dict = extract_NPs(child, np_dict)
return np_dict
nps = extract_NPs(tree, np_dict=defaultdict(list))
The output dictionary has the depth as the key, and a list of NP trees with that depth. Each NP is a Tree, described in the Stanza github here.
I have combed over the code and documentation, and I cannot seem to find a way to map the text of the NPs back to the position in the original input sentence. Simply finding the index of a token in the sentence_tokens doesn't work for me as many of these sentences have repeat tokens.
Any ideas?
You can use
replace_words()to replace each word in the constituency parse with the word's id before processing the tree object (new words still have to be a string):Then you can recover the word ids for a given NP tree using
leaf_labels(). E.g., callingleaf_labels()on the root will now return: