Stemmed words to compute a frequency plot

350 Views Asked by At

I need to plot words frequency:

                2333
appartamento    321
casa            314
cè               54 
case             43
                ... 

However, there are some words having the same stem (then they have a similar meaning). In the example above, casa and case have the same meaning (the first is a singular, the second is a plural name, like house and houses). I read that this issues can be fixed by using nltk.stem. I have, therefore, tried as follows:

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

train_df = (df['Frasi'].str.replace(r'[^\w\s]+', '').str.split(' ').value_counts())

porter = PorterStemmer()
lancaster=LancasterStemmer()

Now I should run a loop for each word in the list above, using porter and Lancaster, but I do not know how to use the list above to stem. Just to give you some context: the list above comes from phrases/sentences, saved into a dataframe. My dataframe has many columns, including a column Frasi where those words come from. An example of phrases included within that column is:

Frasi
Ho comprato un appartamento in centro
Il tuo appartamento è stupendo
Quanti vani ha la tua casa?
Il mercato immobiliare è in crisi
.... 

What I have tried to do is to clean the sentences, removing punctuation and stop words (but it seems spaces are still in, as shown from the word list above). Now I would need to use the information about words frequency to plot the top 10-20 words used, but excluding words with similar meaning or same stem. Should I specify all the suffixes or there is something that I can use to automatise the process?

Any help on this would be great.

1

There are 1 best solutions below

2
DarrylG On BEST ANSWER

Using NLTK

Code

import nltk                                 
from nltk.tokenize import word_tokenize        # https://www.tutorialspoint.com/python_data_science/python_word_tokenization.htm
from nltk.stem.snowball import SnowballStemmer # https://www.nltk.org/howto/stem.html
from nltk.probability import FreqDist          # http://www.nltk.org/api/nltk.html?highlight=freqdist
from nltk.corpus import stopwords              # https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

def freq_dist(s, language):
    " Frequency count based upon language"
    # Language based stops words and stemmer
    fr_stopwords = stopwords.words(language) 
    fr_stemmer = SnowballStemmer(language) 

    # Language based tokenization
    words = word_tokenize(s, language = language)

    return FreqDist(fr_stemmer.stem(w) for w in words if w.isalnum() and not w in fr_stopwords)

Explanation

Initial data in Pandas DataFrame. Obtain French column as string.

s = '\n'.join(df['French'].tolist())

Function freq_dist above does the following upon its input string.

Tokenize string based upon language

words = word_tokenize(s, language='french')

Remove Punctuation (i.e. " ? , . etc.)

punctuation_removed = [w for w in words if w.isalnum()]

Get French stopwords

french_stopwords = set(stopwords.words('french')) # make set for faster lookup

Remove stopwords

without_stopwords = [w for w in punctuation_removed if not w in french_stopwords]

Stem words (which also removes case)

Get French Stemmer and apply stemmer

french_stemmer = SnowballStemmer('french')

Stem words

stemmed_words = [french_stemmer.stem(w) for w in without_stopwords]

Frequency distribution using FreqDist

fDist = FreqDist(stemmed_words)

Example

DataFrame:

                                      French
0               Ho comprato un appartamento in centro
1                      Il tuo appartamento è stupendo
2                         Quanti vani ha la tua casa?
3                   Il mercato immobiliare è in crisi
4                                     Qui vivra verra
5                        L’habit ne fait pas le moine
6                         Chacun voit midi à sa porte
7                      Mieux vaut prévenir que guérir
8                Petit a petit, l’oiseau fait son nid
9   Qui court deux lievres a la fois, n’en prend a...
10                           Qui n’avance pas, recule
11  Quand on a pas ce que l’on aime, il faut aimer...
12  Il n’y a pas plus sourd que celui qui ne veut ...

Generate string

sentences = '\n'.join(df['French'].tolist())

Generate Word Count

counts = freq_dist(sentences, 'french')

Show in alphabetical order

results = sorted(counts.most_common(), 
                 key=lambda x: x[0])
for k, v in results:
    print(k, v)

a 5
aim 2
appartamento 2
aucun 1
avanc 1
cas 1
celui 1
centro 1
chacun 1
comprato 1
court 1
cris 1
deux 1
entendr 1
fait 2
faut 1
fois 1
guer 1
ha 1
hab 1
ho 1
il 3
immobiliar 1
in 2
l 1
lievr 1
mercato 1
mid 1
mieux 1
moin 1
nid 1
oiseau 1
pet 2
plus 1
port 1
prend 1
préven 1
quand 1
quant 1
qui 3
recul 1
sourd 1
stupendo 1
tu 1
tuo 1
van 1
vaut 1
verr 1
veut 1
vivr 1
voit 1
è 2