Trying to for loop MeCab to analyze text descriptions in dataframe

119 Views Asked by At

(Update)As I noted in the comment, somehow I got to run the code successfully but still it is taking way too long like 30 minutes. I would really appreciate if you all could help me figure out more efficient way to compose the code.

I am trying to run the code to analyze the description column in a dataframe but everytime I run it I get runt timeout error. Probably because the dataframe has more than 200,000 rows and the code below is not efficient. Could someone help me understand what is wrong with the code?

list = df["description"].tolist()
new_list = []

for li in list:
    tagger = MeCab.Tagger(ipadic.MECAB_ARGS)
    node = tagger.parseToNode(li)
    keywords = []
    while node:
        if node.feature.split(",")[2] == "組織": #組織 means organization
            keywords.append(node.surface)
        node = node.next
    new_list.append(keywords)
    
df["description"] = new_list
1

There are 1 best solutions below

3
polm23 On BEST ANSWER

Create the Tagger outside the loop.

tagger = MeCab.Tagger(ipadic.MECAB_ARGS)
for li in list:
    ...

The Tagger has a startup cost. It's a small cost, but the object is designed to be reused and shouldn't be recreated inside a loop like that.

Please see this post about using MeCab in Python, which has a section specifically on this problem.