How to Add Product ID Search Functionality to Python Word Counter and Display Associated Words?

42 Views Asked by user14452102 At 25 October 2023 at 21:08

I have a very large list of Movie Reviews. This is my schema sample...it's about 4000 rows, so there's a lot going on.

MovieID	FreeformComment
1234	This was a great movie it made me happy
5678	This movie was bad, I hated it
5678	This movie was decent

Below is my Python script - what it does is give me a total word count for all of the reviews, as well as another count of phrases to try and get more context...i.e. ('scary', 'loved') if they both loved the movie and it was scary. Or ('short', 'value') if the movie was too short and they did not feel they got their money's worth. The script writes a new Excel doc with the counts.

import pandas as pd
import nltk
from nltk.corpus import stopwords
from collections import Counter
import string
from nltk.util import ngrams

nltk.download('punkt')
nltk.download('stopwords')

lemmatizer = nltk.WordNetLemmatizer()

df = pd.read_excel('svydata.xlsx')

comments_df = df["Movie Review Comment"].astype(str)
comments = ' '.join(comments_df)

tokenized_words = nltk.word_tokenize(comments)

translator = str.maketrans('', '', string.punctuation)
tokenized_words = [word.lower() for word in tokenized_words]
tokenized_words = [word.translate(translator) for word in tokenized_words]

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in tokenized_words if word not in stop_words]

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

max_phrase_length = 3  # Adjust as needed
phrases = [phrase for n in range(2, max_phrase_length + 1) for phrase in ngrams(lemmatized_words, n)]

word_counter = Counter(lemmatized_words)
phrase_counter = Counter(phrases)

sorted_word_counter = sorted(word_counter.items(), key=lambda item: item[1], reverse=True)

sorted_phrase_counter = sorted(phrase_counter.items(), key=lambda item: item[1], reverse=True)

word_df = pd.DataFrame(sorted_word_counter, columns=['Word', 'Count'])
phrase_df = pd.DataFrame(sorted_phrase_counter, columns=['Phrase', 'Count'])

with pd.ExcelWriter('word_and_phrase_counts.xlsx', engine='xlsxwriter') as writer:
    word_df.to_excel(writer, sheet_name='Top Words', index=False)
    phrase_df.to_excel(writer, sheet_name='Top Phrases (2-5 words)', index=False)

The issue is I want to put this into a data visualization where I can select a Movie ID filter and then see the word count for only comments regarding that movie i.e. The most common word for movie 1234 was 'great' and the most common phrase was ('loved', 'acting'), or something like that.

The issue is I need to somehow join this based on the MovieID. There isn't really a clean way to do this that requires re-thinking the data architecture. But the data is so large, there aren't tons of options. I'm a beginner and have been trying to find ways around this - would anyone be able to point me in the right direction?

I've tried methods to join in Excel, but it's just not sustainable for dashboarding. I've tried methods to make it a matrix but there's too many words and too many reviews.

Original Q&A

How to Add Product ID Search Functionality to Python Word Counter and Display Associated Words?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in NLP

Related Questions in NLTK

Related Questions in TEXT-PROCESSING

Trending Questions

Popular # Hahtags

Popular Questions