The task that I am trying to achieve is finding the top 20 most common hypernyms for all nouns and verbs in a text file. I believe that my output is erroneous and that there is a more elegant solution, particularly to avoid manually creating a list of the most common nouns and verbs and the code that iterates over the synsets to identify the hypernyms.
Please see below for the code I have attempted so far, any guidance would be appreciated:
nouns_verbs = [token.text for token in hamlet_spacy if (not token.is_stop and not token.is_punct and token.pos_ == "VERB" or token.pos_ == "NOUN")]
def check_hypernym(word_list):
return_list=[]
for word in word_list:
w = wordnet.synsets(word)
for syn in w:
if not((len(syn.hypernyms()))==0):
return_list.append(word)
break
return return_list
hypernyms = check_hyper(nouns_verbs)
fd = nltk.FreqDist(hypernyms)
top_20 = fd.most_common(20)
word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
hypernym_list = []
for word in word_list:
syn_list = wordnet.synsets(word)
hypernym_list.append(syn_list)
final_list = []
for syn in syn_list:
hypernyms_syn = syn.hypernyms()
final_list.append(hypernyms_syn)
final_list
I tried identifying the top 20 most common words and verbs, and then identified their synsets and subsequently their hypernyms. I would prefer to use a more cohesive solution, especially since I am unsure of whether my current result is accurate.
For the first part of getting all nouns and verbs from the text, you didn't provide the original text so I wasn't able to reproduce this but I believe you can shorten this since it is given that if a token is a noun or verb it is not punctuation. You can also use
inso that you don't need two separate boolean conditions forNOUNandVERB.Other than that it looks fine.
For the second part of getting the most common hypernyms, your general approach is fine. You could make it a little more memory efficient for long texts where you potentially have the same hypernym appearing many times by using a
Counterobject from the get-go instead of constructing a long list. See the below code.Outputs: