I need to plot words frequency:
2333
appartamento 321
casa 314
cè 54
case 43
...
However, there are some words having the same stem (then they have a similar meaning).
In the example above, casa and case have the same meaning (the first is a singular, the second is a plural name, like house and houses).
I read that this issues can be fixed by using nltk.stem. I have, therefore, tried as follows:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
train_df = (df['Frasi'].str.replace(r'[^\w\s]+', '').str.split(' ').value_counts())
porter = PorterStemmer()
lancaster=LancasterStemmer()
Now I should run a loop for each word in the list above, using porter and Lancaster, but I do not know how to use the list above to stem.
Just to give you some context: the list above comes from phrases/sentences, saved into a dataframe. My dataframe has many columns, including a column Frasi where those words come from.
An example of phrases included within that column is:
Frasi
Ho comprato un appartamento in centro
Il tuo appartamento è stupendo
Quanti vani ha la tua casa?
Il mercato immobiliare è in crisi
....
What I have tried to do is to clean the sentences, removing punctuation and stop words (but it seems spaces are still in, as shown from the word list above). Now I would need to use the information about words frequency to plot the top 10-20 words used, but excluding words with similar meaning or same stem. Should I specify all the suffixes or there is something that I can use to automatise the process?
Any help on this would be great.
Using NLTK
Code
Explanation
Initial data in Pandas DataFrame. Obtain French column as string.
Function
freq_distabove does the following upon its input string.Tokenize string based upon language
Remove Punctuation (i.e. " ? , . etc.)
Get French stopwords
Remove stopwords
Stem words (which also removes case)
Get French Stemmer and apply stemmer
Stem words
Frequency distribution using FreqDist
Example
DataFrame:
Generate string
Generate Word Count
Show in alphabetical order