Memory Error while Generating User-Movie Combinations for Content-Based Recommender

26 Views Asked by At

I'm working with the MovieLens dataset and aiming to build a content-based recommender using a machine learning algorithm to predict ratings. To achieve this, I need to generate all combinations of users and movies and assign actual ratings for users who have already rated a movie or assign 0 for users who haven't watched it.

I attempted the following method:

import pandas as pd
from itertools import product
from scipy.sparse import csr_matrix

# Get unique user IDs and movie IDs
all_users = df['userId'].unique()
all_movies = df['movieId'].unique()

# Generate all possible combinations of user IDs and movie IDs
all_combinations = product(all_users, all_movies)

# Convert the combinations into a new DataFrame
result_df = pd.DataFrame(all_combinations, columns=['userId', 'movieId'])

# Merge the new DataFrame with the original DataFrame to get the ratings where available
result_df = pd.merge(result_df, df, on=['userId', 'movieId'], how='left')

# Fill NaN values with 0 to represent movies that users haven't watched
result_df['rating'].fillna(0, inplace=True)

# Display the resulting DataFrame
print(result_df)

However, I encountered a memory error. I also attempted to convert my DataFrame to a sparse matrix and then back to a DataFrame, but the memory issue persisted:

from scipy.sparse import csr_matrix

sparse_matrix = csr_matrix((df['rating'], (df['userId'] , df['movieId'])))
dense_matrix = sparse_matrix.toarray()
df_back = pd.DataFrame(data=dense_matrix, index=df['userId'].unique(), columns=df['movieId'].unique())

I'm looking for a solution to avoid this problem without scaling horizontally or vertically because I'm working on my personal computer. Any suggestions would be greatly appreciated. Thanks!

0

There are 0 best solutions below