I have a very large list of Movie Reviews. This is my schema sample...it's about 4000 rows, so there's a lot going on.
| MovieID | FreeformComment |
|---|---|
| 1234 | This was a great movie it made me happy |
| 5678 | This movie was bad, I hated it |
| 5678 | This movie was decent |
Below is my Python script - what it does is give me a total word count for all of the reviews, as well as another count of phrases to try and get more context...i.e. ('scary', 'loved') if they both loved the movie and it was scary. Or ('short', 'value') if the movie was too short and they did not feel they got their money's worth. The script writes a new Excel doc with the counts.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from collections import Counter
import string
from nltk.util import ngrams
nltk.download('punkt')
nltk.download('stopwords')
lemmatizer = nltk.WordNetLemmatizer()
df = pd.read_excel('svydata.xlsx')
comments_df = df["Movie Review Comment"].astype(str)
comments = ' '.join(comments_df)
tokenized_words = nltk.word_tokenize(comments)
translator = str.maketrans('', '', string.punctuation)
tokenized_words = [word.lower() for word in tokenized_words]
tokenized_words = [word.translate(translator) for word in tokenized_words]
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in tokenized_words if word not in stop_words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
max_phrase_length = 3 # Adjust as needed
phrases = [phrase for n in range(2, max_phrase_length + 1) for phrase in ngrams(lemmatized_words, n)]
word_counter = Counter(lemmatized_words)
phrase_counter = Counter(phrases)
sorted_word_counter = sorted(word_counter.items(), key=lambda item: item[1], reverse=True)
sorted_phrase_counter = sorted(phrase_counter.items(), key=lambda item: item[1], reverse=True)
word_df = pd.DataFrame(sorted_word_counter, columns=['Word', 'Count'])
phrase_df = pd.DataFrame(sorted_phrase_counter, columns=['Phrase', 'Count'])
with pd.ExcelWriter('word_and_phrase_counts.xlsx', engine='xlsxwriter') as writer:
word_df.to_excel(writer, sheet_name='Top Words', index=False)
phrase_df.to_excel(writer, sheet_name='Top Phrases (2-5 words)', index=False)
The issue is I want to put this into a data visualization where I can select a Movie ID filter and then see the word count for only comments regarding that movie i.e. The most common word for movie 1234 was 'great' and the most common phrase was ('loved', 'acting'), or something like that.
The issue is I need to somehow join this based on the MovieID. There isn't really a clean way to do this that requires re-thinking the data architecture. But the data is so large, there aren't tons of options. I'm a beginner and have been trying to find ways around this - would anyone be able to point me in the right direction?
I've tried methods to join in Excel, but it's just not sustainable for dashboarding. I've tried methods to make it a matrix but there's too many words and too many reviews.