I am looking to remove stopwords from text to optimise my frequency distribution results
My initial frequency distribution code is written:
# Determine the frequency distribution
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(review_comments)
fdist = FreqDist(tokens)
fdist
This returns
FreqDist({"'": 521, ',': 494, "'the": 22, 'a': 16, "'of": 16, "'is": 12, "'to": 10, "'for": 9, "'it": 8, "'that": 8, ...})
I want to remove the stopwords with the following code
# Delete all the alpanum.
# Filter out tokens that are neither alphabets nor numbers (to eliminate punctuation marks, etc.).
filtered = [word for word in review_comments if word.isalnum()]
# Remove all the stopwords
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords
# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))
# Create a filtered list of tokens without stopwords.
filtered2 = [x for x in filtered if x.lower() not in english_stopwords]
# Define an empty string variable.
filtered2_string = ''
for value in filtered:
# Add each filtered token word to the string.
filtered2_string = filtered2_string + value + ''
Now I run the fdist again
from nltk.tokenize import word_tokenize
trial= nltk.word_tokenize(filtered2_string)
fdist1 = FreqDist(trial)
fdist1
This returns the code
FreqDist({'whenitcomestoadmsscreenthespaceonthescreenitselfisatanabsolutepremiumthefactthat50ofthisspaceiswastedonartandnotterriblyinformativeorneededartaswellmakesitcompletelyuselesstheonlyreasonthatigaveit2starsandnot1wasthattechnicallyspeakingitcanatleaststillstanduptoblockyournotesanddicerollsotherthanthatitdropstheballcompletelyanopenlettertogaleforce9yourunpaintedminiaturesareverynotbadyourspellcardsaregreatyourboardgamesaremehyourdmscreenshoweverarefreakingterribleimstillwaitingforasinglescreenthatisntpolluted': 1})
review_comments = ''
for i in range(newdf.shape[1]):
# Add each comment.
review_comments = review_comments + newdf['tokens1'][i]```
How do I get the stopwords to not remove the spaces and count the words individually?
I removed the stopwords and rerun the frequency distribution hoping to get the most frequent words.
Cleaning in NLP tasks is generally performed on
tokensrather thancharactersof a string to leverage the inbuild functionalities/ methods. However, you can always do this from scratch using your own logic on characters as well, if you need to. Thestopwordsinnltkare in the form of tokens to use for clean up of your text corpus. You can add more tokens that you need to eliminate from your list. For e.g. if you need the english stopwords and punctuations removed, do something like:Example text from a write up on "meaning of life":