I am trying to create a dictionary of words in a text and their context. The context should be the list of words that occur within a 5 word window (two words on either side) of the term's position in the string. Effectively, I want to ignore the stopwords in my output vectors.
My code is below. I can get the stopwords out of my dictionary's keys but not the values.
words = ["This", "is", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]
context_size = 2
stripes = {word:words[max(i - context_size,0):j] for word,i,j in zip(words,count(0),count(context_size+1)) if word.lower() not in stopwords}
print(stripes)
the output is:
{'example': ['is', 'an', 'example', 'sentence'], 'sentence': ['an', 'example', 'sentence']}
I would recommend to use a tuple list so in case a word occurs more than once in
wordsthe dict does not just contain the last one which overwrites previous ones. I would also put stopwords in a set, especially if its a larger list like NLTKs stopwords since that speeds up things.I also excluded the word itself from the context but depending on how you want to use it you might want to include it.
This results in: