Given a NLTK tree produced using the code below, how do I retrieve the leaf values (phrases) that potentially match all of the node labels assigned using the nltk.RegexParser (e.g. those phrases which match the Present_Indefinite or Present_Perfect tense)?
from nltk import word_tokenize, pos_tag
import nltk
text = "#NOVAVAX has produced the #NUVAXOVID vaccine.\
Will that provide a new rally? We see Biotechnology\
Stock $NVAX Entering the Buying Area."
tokenized = word_tokenize(text) # Tokenize text
tagged = pos_tag(tokenized) # Tag tokenized text with PoS tags
my_grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}"""
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
return result
# Apply regex parser and create parse tree
result = check_grammar(my_grammar, tagged)
print(type(result))
# Output: <class 'nltk.tree.tree.Tree'>
More specifically, given that the output of print(result) is as shown below, how can I retrieve the phrases labelled as Present_Perfect and Present_Indefinite, or more generally, any other phrases which match the labels in my grammar?
(S
#/#
NOVAVAX/NNP
(Present_Perfect has/VBZ produced/VBN)
the/DT
#/#
NUVAXOVID/NNP
vaccine/NN
./.
Will/MD
that/WDT
provide/VB
a/DT
new/JJ
rally/NN
?/.
We/PRP
(Present_Indefinite see/VBP)
Biotechnology/NNP
Stock/NNP
$/$
NVAX/NNP
Entering/NNP
the/DT
Buying/NNP
Area/NNP
./.)
I've created a
get_phrases_using_tense_label()function which takes:check_grammar()function (I've renamed it toget_parse_tree()as this is more meaningful in terms of what the function is doing), andThe tense labels are retrieved using the
get_labels_from_grammar()function I created, which iterates over the lines in your grammar and splits the string at the ":" retrieving the tense label.The function then returns the list of phrases (along with their tags) for those nodes in the NLTK tree which match any of your
tense_labels(e.g. "Present_Indefinite" and Present_Perfect" in the solution below). I've used a smaller text as input as an example.Solution