How to speed up language-tool-python library use case

1.8k Views Asked by At

I have a pandas dataframe with 3 million rows of social media comments. I'm using the language-tool-python library to find the number of grammatical errors in a comment. Afaik the language-tool library by default sets up a local language-tool server on your machine and queries responses from that.

Getting the number of grammatical errors is just consists of creating an instance of the language tool object and calling the .check() method with the string you want to check as a parameter.

>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2

So the method I used is df['body_num_errors'] = df['body'].apply(lambda row: len(tool.check(row))). Now I am pretty sure this works. Its quite straight forward. This single line of code has been running for the past hour.

Because running the above example took 10-20 second, so with 3 million instances, it might as well take virtually forever.

Is there any way I can cut my losses and speed this process up? Would iterating over every row and putting the whole thing inside of a threadpoolexecutor help? Intuitively it makes sense to me as its a I/O bound task.

I am open to any suggestions as to how to speed up this process and if the above method works would appreciate if someone can show me some sample code.

edit - Correction.

It takes 10-20 seconds along with the instantiation, calling the method is almost instantaneous.

4

There are 4 best solutions below

0
jxmorris12 On BEST ANSWER

I'm the creator of language_tool_python. First, none of the comments here make sense. The bottleneck is in tool.check(); there is nothing slow about using pd.DataFrame.map().

LanguageTool is running on a local server on your machine. There are at least two major ways to speed this up:

Method 1: Initialize multiple servers

servers = []
for i in range(100):
  servers.append(language_tool_python.LanguageTool('en-US'))

Then call to each server from a different thread. Or alternatively initialize each server within its own thread.

Method 2: Increase the thread count

LanguageTool takes a maxCheckThreads option – see the LT HTTPServerConfig documentation – so you could also try playing around with that? From a glance at LanguageTool's source code, it looks like the default number of threads in a single LanguageTool server is 10.

4
Taslim On

If you are worried about scaling up with pandas, switch to Dask instead. It integrates with Pandas and will use multiple cores in your CPU, which I am assuming you have, instead of a single-core that pandas use. This helps parallelize the 3 million instances and can speed up your execution time. You can read more about dask here or see an example here.

0
liakoyras On

In the documentation, we can see that language-tool-python has the configuration option maxSpellingSuggestions.

However, despite the name of the variable and the default value being 0, I have noticed that the code runs noticeably faster (almost 2 times faster) when this parameter is actually set to 1.

I don't know where this discrepancy comes from, and the documentation does not mention anything specific about the default behavior. It is a fact however that (at least for my own dataset, which I don't think can affect this much the running time) this setting improves the performance.

Example initialization:

import language_tool_python

language_tool = language_tool_python.LanguageTool('en-US', config={'maxSpellingSuggestions': 1})
0
Rawan Jarrar On

Make sure to create an instance "instance of the language tool" once.
Then for each row, call method "or function depending on your code pattern" that includes the rest of the code logic

 matches = tool.check(text)
 len(matches)