Using a Word Counter in Python is understating results

44 Views Asked by At

As a complete preface, I am a beginner and learning. But, here's the sample schema of my products review table.

Record_ID Product_ID Review Comment
1234 89847457 I love this product it was shipped fast and is comfortable

And here is my code. It gives me a total word count for all of the reviews, as well as another count of phrases to try and get more context...i.e. ('flimsy', 'tight') if the fit of the shirt was tight and quality was flimsy. The script writes a new Excel doc with the counts for both.

import pandas as pd
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import string
from collections import Counter
from nltk.util import ngrams
import nltk
nltk.download('punkt')

df = pd.read_excel('productsvydata.xlsx')

def preprocess_text(text):
    translator = str.maketrans('', '', string.punctuation)
    text = text.lower() 
    text = text.translate(translator)
    return text

word_counts = {}
phrase_counts = {}

unique_product_ids = df["Product_ID"].unique()

# Set the number of top words and phrases you want to keep
top_n = 100

for selected_product_id in unique_product_ids:
    selected_comments_df = df[df["Product_ID"] == selected_product_id]
    selected_comments = ' '.join(selected_comments_df["Product Review Comment"].astype(str))
    selected_comments = preprocess_text(selected_comments)
    if not selected_comments.strip():
        continue
    tokenized_words = nltk.word_tokenize(selected_comments)
    stop_words = set(ENGLISH_STOP_WORDS)
    filtered_words = [word for word in tokenized_words if word not in stop_words]
    lemmatizer = nltk.WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
    max_phrase_length = 4
    phrases = [phrase for n in range(2, max_phrase_length + 1) for phrase in ngrams(lemmatized_words, n)]
    word_counter = Counter(lemmatized_words)
    phrase_counter = Counter(phrases)

    # Get the top N words and phrases
    top_words = dict(word_counter.most_common(top_n))
    top_phrases = dict(phrase_counter.most_common(top_n))

    # Extract record_id for each Product_ID
    record_ids = selected_comments_df["record_id"].values[0]

    word_counts[(selected_product_id, record_ids)] = top_words
    phrase_counts[(selected_product_id, record_ids)] = top_phrases

word_result_data = []
phrase_result_data = []

for (product_id, record_id), top_words in word_counts.items():
    for word, count in top_words.items():
        word_result_data.append([product_id, record_id, word, count])
for (product_id, record_id), top_phrases in phrase_counts.items():
    for phrase, count in top_phrases.items():
        phrase_result_data.append([product_id, record_id, phrase, count])

word_df = pd.DataFrame(word_result_data, columns=['Product_ID', 'record_id', 'Word', 'Count'])
phrase_df = pd.DataFrame(phrase_result_data, columns=['Product_ID', 'record_id', 'Phrase', 'Count'])

word_df.to_csv('top_words_counts.csv', index=False)
phrase_df.to_csv('top_phrases_counts.csv', index=False)

I used top_n = 100 to just get around the top 100 words in the export because there's over 20,000 rows of data and if I Do all of the words and phrases, the thing will not run. It needs to both use product id and record id because that's what it joins onto in my work tool.

The issue is I feel the results are very understated. I am wondering if it has to do with tokenization. For instance, right now I have 9 instances of the word 'customer' in our data in this export. And in the phrase count, ('customer', 'service') comes up even less. If I just control F through the raw Product Review Comments in the original document, there's way more instances of people speaking about customer service. Something's going wrong in the processing, but I don't know what.

Would anyone be able to help suggest ways to better optimize this code as well as yield a larger number of results? It's pretty basic NLP but again, I'm new, I want to learn, but I've hit a blocker in my output.

1

There are 1 best solutions below

0
NLP from scratch On

Though it would make doing lemmatization a little more difficult, I always recommend using sklearn's CountVectorizer, which includes stopword removal, as opposed to doing things the hard way nltk & base python.

Also, in your preprocessing, you can use apply method to do the preprocessing more efficiently on the whole review column at once. I would suggest you don't need to join all the reviews for each product, then tokenize; instead doing the word / n-gram count for each record, then just summing the counts by grouping by Product ID will work:

# Preprocess the review column
df['Product Review Comment'] = df['Product Review Comment'].apply(preprocess_text)

# Instantiate CountVectorizer
cv = CountVectorizer(stop_words='english', min_df=100, ngram_range=(1,5))

# Create document-term dataframe of word counts per record 
dtm = cv.fit_transform(df['Product Review Comment'])
dtm_df = pd.DataFrame(dtm.todense(), columns=cv.get_feature_names_out())

# Join to the original data
joined_df = pd.concat([df, dtm_df], axis=1)

# Find the sum of word counts per product
word_count_df = joined_df.groupby('Product_ID').sum().drop('Record_ID', axis=1).reset_index()

# Flatten / convert DTM from wide format to long format
long_df = pd.melt(word_count_df, id_vars=['Product_ID'], var_name='var', value_name='value')

# Find 100 top words per product
long_df.groupby('Product_ID').head(100)

For lemmatization, you'll need to write your own tokenizer function which includes it and pass into the CountVectorizer in the tokenizer argument. Also, if you have a large dataset you probably want to set min_df higher so your document-term matrix doesn't get exceedingly large, however, if you're just worried about the top terms (over the whole dataset) than this should be fine.

Hope this helps!