My Python code takes 10 minutes to run in Visual Studio Code

Question

My Python code takes 10 minutes to run in Visual Studio Code

92 Views Asked by Huy Dang At 26 February 2024 at 20:13

I am attempting to remove stopwords from column 'reviews.text' in a .csv file. When I run the code, it takes 10 minutes for the output.

How do I speed up the run time?

import pandas as pd
from os import chdir, path
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

chdir(path.dirname(__file__))

file_path = 'amazon_product_reviews.csv'
dataframe = pd.read_csv(file_path, dtype={'id': str, 'name': str, 'asins': str, 'brand': str, 'categories': str, 'keys': str, 'manufacturer': str, 'reviews.date': str, 'reviews.dateAdded': str, 'reviews.dateSeen': str, 'reviews.didPurchase': str, 'reviews.doRecommend': str, 'reviews.id': str, 'reviews.numHelpful': str, 'reviews.rating': str, 'reviews.sourceURLs': str, 'reviews.text': str, 'reviews.title': str, 'reviews.userCity': str, 'reviews.userProvince': str, 'reviews.username': str, })

reviews_data = dataframe['reviews.text']

clean_data = dataframe.dropna(subset=['reviews.text'])

def preprocess_text(text):
    doc = nlp(text)
    
    cleaned_tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
    
    cleaned_text = ' '.join(cleaned_tokens)
    
    return cleaned_text

clean_data = clean_data.copy()
clean_data['processed_reviews'] = clean_data['reviews.text'].apply(preprocess_text)

print("Cleaned Data:")
print(clean_data[['reviews.text', 'processed_reviews']].head())

EDIT: I ran cProfile to see which areas of my code was taking the most time. See below my cProfile results:

  302681427 function calls (296741014 primitive calls) in 294.594 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     10/1    0.000    0.000  294.659  294.659 {built-in method builtins.exec}
        1    0.003    0.003  294.639  294.639 test3.py:10(main)
        1    0.000    0.000  293.915  293.915 series.py:4769(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1409(apply)
        1    0.000    0.000  293.915  293.915 apply.py:1482(apply_standard)
        1    0.000    0.000  293.915  293.915 base.py:891(_map_values)
        1    0.121    0.121  293.915  293.915 algorithms.py:1667(map_array)
    34659    0.047    0.000  293.793    0.008 test3.py:29(preprocess_text)
    34659    0.465    0.000  293.253    0.008 language.py:1016(__call__)
   138636   32.197    0.000  242.236    0.002 trainable_pipe.pyx:40(__call__)
   138636    0.531    0.000  205.376    0.001 model.py:330(predict)
4678965/277272    1.998    0.000  203.319    0.001 model.py:307(__call__)
1628973/138636    2.245    0.000  187.239    0.001 chain.py:48(forward)
   242613    0.263    0.000  180.916    0.001 with_array.py:32(forward)
   519885  157.488    0.000  157.731    0.000 numpy_ops.pyx:91(gemm)
   346590    2.671    0.000  145.591    0.000 maxout.py:45(forward)
   103977    0.291    0.000  132.341    0.001 with_array.py:70(_list_forward)
   277272    0.548    0.000  127.896    0.000 residual.py:28(forward)
    69318    0.632    0.000  107.110    0.002 tb_framework.py:33(forward)

Original Q&A

There are 1 best solutions below

**Teemu Risikko** · Answer 1 · 2024-02-26T21:21:58.493000

You should be able to shave a lot off by only including processors from nlp that you really need!

I have a small test input set, so the results might not apply for a larger one as dramatically, but for me this gives a tremendous speedup:

def preprocess_text(text):        
    with nlp.select_pipes(enable="tagger"):
        return ' '.join(token.text.lower() for token in nlp(text) if token.is_alpha and not token.is_stop)

A list of the pipeline processors, what they do and how to tweak them can be found from https://spacy.io/usage/processing-pipelines.

Other than that, calling apply is always slow business in pandas, but I at least don't know how to solve this particular problem with vectorized operations.

My Python code takes 10 minutes to run in Visual Studio Code

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in SPACY

Related Questions in STOP-WORDS

Trending Questions

Popular # Hahtags

Popular Questions