I'm trying to implement collaborative filtering in Scala and Spark for a personal project. I'm using this dataset: [https://www.kaggle.com/datasets/antonkozyriev/game-recommendations-on-steam/data][1], a large set containing games, users, reviews, etc.
What I would like to do is create a simple filter that, given a user id taken as input, provides N similar users as output based on 3 columns of the dataset (user_id, app_id, hours_played). I've tried an approach using ALS model from the MLlib library, but I'm only able to get recommendations of a game for a user, and not users similar to a specific user.
This is the code I've tried so far, can anyone help me out?
/* Load data */
val rawData = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvFilePath)
val limitedData = rawData.limit(200000)
/* Pre-processing */
val userIndexer = new StringIndexer()
.setInputCol("user_id")
.setOutputCol("user_id_indexed")
val userIndexedData = userIndexer.fit(limitedData).transform(limitedData)
val appIndexer = new StringIndexer()
.setInputCol("app_id")
.setOutputCol("app_id_indexed")
val data = appIndexer.fit(userIndexedData).transform(userIndexedData)
/* training ALS */
val als = new ALS()
.setUserCol("user_id_indexed")
.setItemCol("app_id_indexed")
.setRatingCol("hours")
.setRank(10)
.setMaxIter(10)
.setRegParam(0.1)
.setImplicitPrefs(true)
val model = als.fit(data)
/* Generate 5 recommendations */
import spark.implicits._
val userId = 0
val userSubset = Seq(userId).toDF("user_id_indexed") // Create the DataFrame
val recommendations = model.recommendForUserSubset(userSubset, 5)
recommendations.show()
spark.stop()
Finally, to have a clearer idea of what I want to achieve I leave a small snippet of python code that exactly implements my idea for this dataset.
user_ids = recommendations_df['user_id'].astype('category').cat.codes
item_ids = recommendations_df['app_id'].astype('category').cat.codes
# Get the unique user and game ids
unique_user_ids = recommendations_df['user_id'].astype('category').cat.categories
unique_item_ids = recommendations_df['app_id'].astype('category').cat.categories
# create a sparse matrix
user_game_matrix = coo_matrix((recommendations_df['hours'], (user_ids, item_ids)))
# Fit the model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(user_game_matrix)
# Get top 5 recommendations for first user
distances, indices = model_knn.kneighbors(user_game_matrix.getrow(0), n_neighbors=6)
recommended_users = [unique_user_ids[i] for i in indices.flatten()[1:]]
print(f'Recommended users for the first user are: {recommended_users}')```
Output:
Recommended users for the first user are: [3123620, 5031804, 1543163, 2829043, 1943227]