Improving the performance of re.findall() being called many times and with a long pattern

37 Views Asked by At

I have a pd.dataFrame with about 10.000 rows, each containing a text for which I have to sum up all occurences of words, contained in a lexicon (the lexicon also has about 10k entries).

I have written code that works, but takes quite a long time on my hardware (around 6-8 Minutes) and I strongly suspect that there is a better way to do what I want to do.

The main culprit is the count_sentiments() Function:

def prepare_data(data:pd.DataFrame, lexicon:pd.DataFrame):
    """Calculate the needed features and write them to the provided dataframe"""

    # Filter the lexicon to create two lists of words
    positiveWords = lexicon[lexicon['sentiment'] > 0]['term'].astype(str).tolist()
    negativeWords = lexicon[lexicon['sentiment'] < 0]['term'].astype(str).tolist()

    # Create columns for our features 'pos_count', 'neg_count', 'contains_no', 'pron_count', 'contains_exclam', 'token_log'
    # The values get calculated by the applied function
    # apply() maps a function to all the members of the vector (the pd.Series object)

    # This takes around 2-3 Minutes on my hardware
    data['pos_count'] = data['review'].apply(count_sentiments, args=(positiveWords,))

    # This takes around 4-5 Minutes on my hardware
    data['neg_count'] = data['review'].apply(count_sentiments, args=(negativeWords,))

    return data

def count_sentiments(document, words):
    """Counts all positive/negative sentiment word occurences in the document"""

    sentimentSum = len(re.findall(r'\b(?:' + '|'.join(words) + r')\b', document))

    return sentimentSum

Any ideas will be appreciated, thanks in advance!

0

There are 0 best solutions below