My Python code takes 10 minutes to run in Visual Studio Code

92 Views Asked by At

I am attempting to remove stopwords from column 'reviews.text' in a .csv file. When I run the code, it takes 10 minutes for the output.

How do I speed up the run time?

import pandas as pd
from os import chdir, path
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

chdir(path.dirname(__file__))

file_path = 'amazon_product_reviews.csv'
dataframe = pd.read_csv(file_path, dtype={'id': str, 'name': str, 'asins': str, 'brand': str, 'categories': str, 'keys': str, 'manufacturer': str, 'reviews.date': str, 'reviews.dateAdded': str, 'reviews.dateSeen': str, 'reviews.didPurchase': str, 'reviews.doRecommend': str, 'reviews.id': str, 'reviews.numHelpful': str, 'reviews.rating': str, 'reviews.sourceURLs': str, 'reviews.text': str, 'reviews.title': str, 'reviews.userCity': str, 'reviews.userProvince': str, 'reviews.username': str, })

reviews_data = dataframe['reviews.text']

clean_data = dataframe.dropna(subset=['reviews.text'])

def preprocess_text(text):
    doc = nlp(text)
    
    cleaned_tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    
    cleaned_text = ' '.join(cleaned_tokens)
    
    return cleaned_text

clean_data = clean_data.copy()
clean_data['processed_reviews'] = clean_data['reviews.text'].apply(preprocess_text)

print("Cleaned Data:")
print(clean_data[['reviews.text', 'processed_reviews']].head())

EDIT: I ran cProfile to see which areas of my code was taking the most time. See below my cProfile results:

  302681427 function calls (296741014 primitive calls) in 294.594 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     10/1    0.000    0.000  294.659  294.659 {built-in method builtins.exec}
        1    0.003    0.003  294.639  294.639 test3.py:10(main)
        1    0.000    0.000  293.915  293.915 series.py:4769(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1409(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1482(apply_standard)
        1    0.000    0.000  293.915  293.915 base.py:891(_map_values)
        1    0.121    0.121  293.915  293.915 algorithms.py:1667(map_array)
    34659    0.047    0.000  293.793    0.008 test3.py:29(preprocess_text)
    34659    0.465    0.000  293.253    0.008 language.py:1016(__call__)
   138636   32.197    0.000  242.236    0.002 trainable_pipe.pyx:40(__call__)
   138636    0.531    0.000  205.376    0.001 model.py:330(predict)
4678965/277272    1.998    0.000  203.319    0.001 model.py:307(__call__)
1628973/138636    2.245    0.000  187.239    0.001 chain.py:48(forward)
   242613    0.263    0.000  180.916    0.001 with_array.py:32(forward)
   519885  157.488    0.000  157.731    0.000 numpy_ops.pyx:91(gemm)
   346590    2.671    0.000  145.591    0.000 maxout.py:45(forward)
   103977    0.291    0.000  132.341    0.001 with_array.py:70(_list_forward)
   277272    0.548    0.000  127.896    0.000 residual.py:28(forward)
    69318    0.632    0.000  107.110    0.002 tb_framework.py:33(forward)
1

There are 1 best solutions below

0
Teemu Risikko On

You should be able to shave a lot off by only including processors from nlp that you really need!

I have a small test input set, so the results might not apply for a larger one as dramatically, but for me this gives a tremendous speedup:

def preprocess_text(text):        
    with nlp.select_pipes(enable="tagger"):
        return ' '.join(token.text.lower() for token in nlp(text) if token.is_alpha and not token.is_stop)

A list of the pipeline processors, what they do and how to tweak them can be found from https://spacy.io/usage/processing-pipelines.

Other than that, calling apply is always slow business in pandas, but I at least don't know how to solve this particular problem with vectorized operations.