Word2Vec to calculate similarity of movies to high preforming movies

47 Views Asked by At

I have a dataset with user ratings for movies and movie descriptions like this

import pandas as pd

df =pd.DataFrame ({
    'description': [
        'Two imprisoned men bond over a number of years',
        'A family heads to an isolated hotel for the winter',
        'In a future where technology controls everything',
        'A young lion prince flees his kingdom only to learn the true meaning of responsibility',
        'A group of intergalactic criminals are forced to work together to stop a fanatical warrior'
    ],
    'ratings': [8.7, 9.3, 7.9, 8.5, 8.1]
})
df

I want to use the description (along with other features) to predict the ratings of movies.

I am trying to use Word2Vec to calculate a similarity score that will determine how similar a new movie is to past movies that performed well. My plan was to define the top performing movies, and calculate a similarity score for all movies in the dataset before using the dataset with another machine learning algorithm.

But I am having trouble calculating the similarity score (I've never used this method before).

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Create Tokens
df['tokenized_description'] = df['description'].apply(lambda x: word_tokenize(x.lower()))

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=df['tokenized_description'], vector_size=100, window=5, min_count=1, workers=4)

# define top performing movies
threshold = df['ratings'].quantile(0.75)
highest_grossing_movies = df[df['ratings'] >= threshold]

# Tokenize descriptions of highest-grossing movies
highest_grossing_movies['tokenized_description'] = highest_grossing_movies['description'].apply(lambda x: word_tokenize(x.lower()))

# Convert the tokenized descriptions to embeddings
embeddings_high_grossing = highest_grossing_movies['description'].apply(lambda desc: word2vec_model.wv[word_tokenize(desc)]).tolist()

# Assess similarity for each movie description in the entire DataFrame
df['similarity_score'] = [word2vec_model.wv.similarity(df['description'])

When I run the code, I get the error

KeyError: "Key 'Two' not present"

I'm sure the last line of the code is wrong, but I'm not sure of how to correct this.

1

There are 1 best solutions below

0
حمزة نبيل On

Make sure to convert the input descriptions to lowercase:

# Convert the tokenized descriptions to embeddings
embeddings_high_grossing = highest_grossing_movies['description'].apply(
    lambda desc: word2vec_model.wv[word_tokenize(desc.lower())]
).tolist()