I want to get the length of words in the WordNet corpus
Code:
from nltk.corpus import wordnet as wn
len_wn = len([word.lower() for word in wn.words()])
print(len_wn)
I get the output as 147306
My Questions:
- Am I getting the total length of words in
WordNet? - Does
tokenssuch aszoom_incounts asword?
Am I getting the total length of words in WordNet?
Depends on what is the definition of "words". The
wn.words()function iterate through all thelemma_names, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1701 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1191So if the definition of "words" is all the possible lemma, then yes, this function gives you the total length of words in the lemma names in Wordnet:
Lowercasing is not necessary because lemma names should have came lowered. Even named entities, e.g.
But do note that the same lemma can have very similar lemma names:
That's because of how wordnet is structured. The API in NLTK organizes "meaning" as synset, a synset contains is linked to multiple lemmas and each lemma comes with at least one name:
But each "word" you query can have multiple synsets (i.e. multiple meaning), e.g.
Does tokens such as zoom_in counts as word?
Depends on what's the definition of a "word", like the example above, if you iterate through the
wn.words(), you are iterating through the lemma_names and thenew_yorkexamples shows that multi-word expressions exists in the lemma name lists for each synset.