Content-Based Filtering for Tagged Posts

28 Views Asked by At

Context

Goal

Based on a user's previous Post Reactions (Documents & Classes) (FavLike, Fav, Like, Dislike, None) (FavLike to keep classes mutually exclusive; FavDislike is possible, but, for obvious reasons, I don't consider it),
I want to sort a batch of new Posts based on each Reaction probability, and, in the future, maybe also on a combined score based on those probabilities.

For that, I'm currently using Multinomial Naive Bayes and Logistic Regression.

Data & Evaluation

Firstly, there's the RawData:

RawData = {
  # <Category>: {<Post>: [<Tag>]}
  None: {
    1234: ['tag', 'beach', 'photography', 'etc'],
    # [...]
  },
  # [...]
}

It then is processed to Counts:

Counts = {
  # Totals = N of Documents in each category, not number of tags
  Totals: [671, 587, 1_310, 3_353, 66_994], # Actual data sample
  Tags: {
    'tag': [0, 0, 0, 4, 41],
    'photography': [601, 507, 1_080, 2_552, 56_711],
    'etc': [0, 0, 0, 0, 1],
    # [...]
  }
}

After that, it is preprocessed into Bayes (using log-probabilities to save up on computation during evaluation; also using Laplace smoothing):

Bayes = {
  Priors: [−1.59905, −1.61782, −1.48773, −1.20228, −0.06781],
  Tags: {
    'tag': [−1.69897, −1.69897, −1.69897, -1, −0.075720],
    # [...]
  }
}

Then, after evaluating a Post, the log-probabilities go through a Standard Scaler and Logistic Regressor to get some actual probabilities.

Things To Consider

  • A Post is a set of Tags (Events)
    • Tag order doesn't matter
    • Tag either is in a Post, or it isn't
    • The number of Tags in a Post is variable
  • Tags may imply another (insignificant problem, though)
    • 'red_shirt' implies 'shirt'
  • Space is limited: After processing the Raw Data, only Counts is used, Bayes is calculated on startup, and when the user reacts to a new Post, its Tags are added to Count and updated on Bayes

Where problems arise

  • Reaction classes are severely imbalanced
    • None outnumbers all the others combined (None count is 67k, while the others together crack 4k)
  • Tags are really sparse
    • While some Tags have thousands of occurrences, the majority only have a handful, having 0 counts for all but 1 or 2 Reaction classes
  • Posts with lots of Tags will get really low log-probabilities
    • The mean for each Reaction: [-307.23473004, -314.77733785, -270.50181329, -206.64016692, -8.65777954]
    • And their standard deviation: [150.89012448, 154.35610454, 134.04458405, 101.16143351, 6.68529216]
    • Sorting Posts that way will just essentially sort on the number of Tags
    • This can be mitigated if we relativise values of the log-probs (I did this with Logistic Regression)
  • Even with Logistic Regression, the final probabilities are still too close to the priors of each category

Previous Iterations

The things I described is where I'm currently at, but I've had one previous attempt that worked more or less decently:

I merged the categories into just two, one positive, one negative:

  • Positive = 3 * FavLike + 2 * Fav + 1 * Like
  • Negative = 1 * Dislike (At the time, I didn't have access to the None Posts)

Based on the Reaction of a Post, I added x times more to each Tag's count.

Then, to evaluate, I just subtracted the log-probabilities of Positive and Negative.

Actual Question

Given all that, how can I improve the Reaction prediction of my Recommender System, such that I may sort them on each category?

Final Notes

If it still fits with my restrictions (See Things To Consider), I may consider switching from Naive Bayes entirely, but I prefer to still utilise some of my currently existing work.

I'm currently looking into Complement Naive Bayes, so I'm already aware of that possibility.

I'm a programmer by trade, so I'd appreciate for tips on how to implement your suggestions with code, and please limit the mathematics.
(I'm using JavaScript (Yes, really...), but I'm familiar with several languages: Python, Rust, C#, Java, Go, C, C++ etc)

0

There are 0 best solutions below