I am attempting to remove stopwords from column 'reviews.text' in a .csv file. When I run the code, it takes 10 minutes for the output.
How do I speed up the run time?
import pandas as pd
from os import chdir, path
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
chdir(path.dirname(__file__))
file_path = 'amazon_product_reviews.csv'
dataframe = pd.read_csv(file_path, dtype={'id': str, 'name': str, 'asins': str, 'brand': str, 'categories': str, 'keys': str, 'manufacturer': str, 'reviews.date': str, 'reviews.dateAdded': str, 'reviews.dateSeen': str, 'reviews.didPurchase': str, 'reviews.doRecommend': str, 'reviews.id': str, 'reviews.numHelpful': str, 'reviews.rating': str, 'reviews.sourceURLs': str, 'reviews.text': str, 'reviews.title': str, 'reviews.userCity': str, 'reviews.userProvince': str, 'reviews.username': str, })
reviews_data = dataframe['reviews.text']
clean_data = dataframe.dropna(subset=['reviews.text'])
def preprocess_text(text):
doc = nlp(text)
cleaned_tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
cleaned_text = ' '.join(cleaned_tokens)
return cleaned_text
clean_data = clean_data.copy()
clean_data['processed_reviews'] = clean_data['reviews.text'].apply(preprocess_text)
print("Cleaned Data:")
print(clean_data[['reviews.text', 'processed_reviews']].head())
EDIT: I ran cProfile to see which areas of my code was taking the most time. See below my cProfile results:
302681427 function calls (296741014 primitive calls) in 294.594 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
10/1 0.000 0.000 294.659 294.659 {built-in method builtins.exec}
1 0.003 0.003 294.639 294.639 test3.py:10(main)
1 0.000 0.000 293.915 293.915 series.py:4769(apply)
1 0.000 0.000 293.915 293.915 apply.py:1409(apply)
1 0.000 0.000 293.915 293.915 apply.py:1482(apply_standard)
1 0.000 0.000 293.915 293.915 base.py:891(_map_values)
1 0.121 0.121 293.915 293.915 algorithms.py:1667(map_array)
34659 0.047 0.000 293.793 0.008 test3.py:29(preprocess_text)
34659 0.465 0.000 293.253 0.008 language.py:1016(__call__)
138636 32.197 0.000 242.236 0.002 trainable_pipe.pyx:40(__call__)
138636 0.531 0.000 205.376 0.001 model.py:330(predict)
4678965/277272 1.998 0.000 203.319 0.001 model.py:307(__call__)
1628973/138636 2.245 0.000 187.239 0.001 chain.py:48(forward)
242613 0.263 0.000 180.916 0.001 with_array.py:32(forward)
519885 157.488 0.000 157.731 0.000 numpy_ops.pyx:91(gemm)
346590 2.671 0.000 145.591 0.000 maxout.py:45(forward)
103977 0.291 0.000 132.341 0.001 with_array.py:70(_list_forward)
277272 0.548 0.000 127.896 0.000 residual.py:28(forward)
69318 0.632 0.000 107.110 0.002 tb_framework.py:33(forward)
You should be able to shave a lot off by only including processors from
nlpthat you really need!I have a small test input set, so the results might not apply for a larger one as dramatically, but for me this gives a tremendous speedup:
A list of the pipeline processors, what they do and how to tweak them can be found from https://spacy.io/usage/processing-pipelines.
Other than that, calling
applyis always slow business inpandas, but I at least don't know how to solve this particular problem with vectorized operations.