Failed lemmatization

49 Views Asked by At

I'm trying to lemmatize german texts which are in a dataframe. I use german library to succesfully handle with specific grammatic structure: https://github.com/jfilter/german-preprocessing

My code:

from german import preprocess

df = pd.read_csv('Afd.csv', sep=',')

Lemma = open('MessageAFD_lemma.txt', 'w')
for i in df['message']:
    preprocess (i, remove_stop=True)
    Lemma.write(i)
Lemma.close()

The process of lemmatization goes successfully, there's no any error in the terminal, but openning the file "MessageAFD_lemma.txt", I get this : (nothing was lemmatized)

The expected result is like:

Input:

preprocess(['Johpannes war einer von vielen guten Schülern.', 'Julia trinkt gern Tee.'], remove_stop=True)

Output: ['johannes gut schüler', 'julia trinken tee']

What goes wrong?

1

There are 1 best solutions below

3
BoppreH On BEST ANSWER

The preprocess function returns a copy of the texts, instead of modifying the input. So you need to write the result of preprocess to the file, not the original i messages.

Furthermore, preprocess accepts a list of texts to process, so you must wrap your message in [message], and extract the single result from the returned list with result, = ...

from german import preprocess

df = pd.read_csv('Afd.csv', sep=',')

Lemma = open('MessageAFD_lemma.txt', 'w')
for message in df['message']:
    result, = preprocess([message], remove_stop=True)
    Lemma.write(result)
Lemma.close()

# Or, to process all messages in one go:
with open('MessageAFD_lemma.txt', 'w') as f:
    for result in preprocess(df['message'], remove_stop=True):
        f.write(result)