I'm implement an algorithm that returns popular posts at the moment, given his likes and dislikes.
To do this, for each post I add all his likes (1) and dislikes (-1) to get his score but each like/dislike is weighted : the latest, the heaviest. For example, at the moment an user likes a post, his like weights 1. After 1 day, it weights 0.95 (or -0.95 if it's a dislike), after 2 days, 0.90, and so on... With a minimal of 0.01 reached after 21 days. (PS: Theses are totally approximate values)
Here are how my tables are made :
Posts table
id | Title | user_id | ...
-------------------------------------------
1 | Random post | 10 | ...
2 | Another post | 36 | ...
n | ... | n | ...
Likes table
id | vote | post_id | user_id | created
----------------------------------------
1 | 1 | 2 | 10 | 2014-08-18 15:34:20
2 | -1 | 1 | 24 | 2014-08-15 18:54:12
3 | 1 | 2 | 54 | 2014-08-17 21:12:48
Here is the SQL query I'm currently using which does the job
SELECT Post.*, Like.*,
SUM(Like.vote *
(1 - IF((TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21 > 0.99, 0.99, (TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21))
) AS score
FROM posts Post
LEFT JOIN likes Like ON (Post.id = Like.post_id)
GROUP BY Post.id
ORDER BY score DESC
PS: I'm using TIMESTAMPDIFF with MINUTE and not DAY directly because I'm calculating the day myself otherwise it returns me an integrer and I want a float value, in order to gradually decay overtime and not day per day. So TIMESTAMPDIFF(MINUTE, Like.created, NOW())/60/24 just gives me the number of day passed since the like creation with the decimal part.
Here are my questions :
- Look at the
IF(expr1, expr2, expr3)part : it is necessary in order to set minimal value for the like's weight, so it will not go under 0.01 and become negative (and so the like, even older still has a little weight). But I'm calculating 2 times the same thing : expr1 is the same as expr2. Isn't there a way to avoid this duplicate expression ? - I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big
PostandLiketable. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entriesLiketable). - Should I create Index in
Liketable for post_id ? And for created ?
Thank you !
EDIT: Imagine a Post can have multiple tags, and each tag can belong to multiple posts. If I want to get populars Posts given a Tag or multiple Tag, I can't cache each query ; as there is a good amount of possible queries. Is the query still viable so ?
EDIT FOR FINAL SOLUTION: I finally did some tests. I created a table Post with 30 000 entries and Like with 250 000 entries. Without index, the query was incredibly long (timed out > 10mn), but with indexes on Post.id (primary), Like.id(primary) and Like.post_id it took ~0.5s.
So I'm not caching the data, neither using update every 5mn. If the table keeps growing this is still possible solution (over 1s it's not acceptable).
10000 and 50000 are considered small on current hardware. With those table sizes you probably won't need any cache, unless the query will run several times per second. Anyway, I would do a performance test before deciding to have a cache.
I would create an index for (post_id, created, vote). That way the query can get all information from the index and doesn't need to read the table at all.
Edit (response to comments):
An extra index will slow down inserts/updates slightly. In the end, the path you choose will dictate the characteristics of what you need in terms of CPU/RAM/Disk I/O. If you have enough RAM for the DB so that you expect the entire
Liketable to be cached in RAM then you might be better off with an index on justpost_id.In terms of total load you need to consider the ratio between
insertandselectand the relative cost of insert and select with or without the index. My gut feeling is that the total load will be lower with the index.Regarding your question on concurrency (selecting and inserting simultaneously). What happens depends on the isolation level. The general advice is to keep inserts/updates as short as possible. If you don't do unneccessary things between the start of the
insertand thecommityou should be fine.