Rewrite SQL query to Fix Functional Dependency Issue Caused By MySQL 5.7 Strict Mode

167 Views Asked by At

I recently upgraded my MySQL server to version 5.7 and the following example query does not work:

SELECT * 
FROM (SELECT * 
        FROM exam_results 
        WHERE exam_body_id = 6674 
        AND exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) 
        AND subject_ids LIKE '%4674%' 
        ORDER BY score DESC 
    ) AS top_scores 
GROUP BY user_id 
ORDER BY percent_score DESC, time_advantage DESC 
LIMIT 10

The query is supposed to select exam results from the specified table matching the top scorers who wrote a particular exam within some time interval. The reason why I had to include a GROUP BY clause when I first wrote the query was to eliminate duplicate users, i.e. users who have more than one top score from writing the exam within the same time period. Without eliminating duplicate user IDs, a query for the top 10 high scorers could return exam results from the same person.

My question is: how do I rewrite this query to remove the error associated with MySQL 5.7 strict mode enforced on GROUP BY clauses while still retaining the functionality I want?

3

There are 3 best solutions below

6
On BEST ANSWER

That is because you never really wanted aggregation to begin with. So, you used a MySQL extension that allowed your syntax -- even though it is wrong by the definition of SQL: The GROUP BY and SELECT clauses are incompatible.

You appear to want the row with the maximum score for each user meeting the filtering conditions. A much better approach is to use window functions:

SELECT er.* 
FROM (SELECT er.*,
             ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY score DESC) as seqnum
      FROM exam_results er 
      WHERE exam_body_id = 6674 AND
            exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) AND
            subject_ids LIKE '%4674%' 
    ) er
WHERE seqnum = 1
ORDER BY percent_score DESC, time_advantage DESC 
LIMIT 10;

You can do something similar in older versions of MySQL. Probably the closest method uses variables:

SELECT er.*,
       (@rn := if(@u = user_id, @rn + 1,
                  if(@u := user_id, 1, 1)
                 )
       ) as rn
FROM (SELECT er.*
      FROM exam_results 
      WHERE exam_body_id = 6674 AND
            exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) AND
            subject_ids LIKE '%4674%' 
      ORDER BY user_id, score DESC
     ) er CROSS JOIN
     (SELECT @u := -1, @rn := 0) params
HAVING rn = 1
ORDER BY percent_score DESC, time_advantage DESC 
LIMIT 10
2
On

An alternative to Gordon's answer using user-defined variables and a CASE conditional statement for older versions of MySQL is as follows:

SELECT *
    FROM (
        SELECT *,
            @row_number := CASE WHEN @user_id <> er.user_id 
                                THEN 1 
                                ELSE @row_number + 1 END 
                           AS row_number,
            @user_id := er.user_id
        FROM exam_results er
        CROSS JOIN (SELECT @row_number := 0, @user_id := null) params
            WHERE exam_body_id = 6674 AND
            exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) AND
            subject_ids LIKE '%4674%' 
        ORDER BY er.user_id, score DESC
    ) inner_er
HAVING inner_er.row_number = 1
ORDER BY score DESC, percent_score DESC, time_advantage DESC 
LIMIT 10

This achieved the filtering behavior I wanted without having to rely on the unpredictable behavior of a GROUP BY clause and aggregate functions.

3
On

When you aggregate (GROUP BY) a result set by a subset of the columns (user_id), then all the other columns need to be aggregated.

Note: according to the SQL Standard if you are grouping by the primary key this is not necessary, since all the other columns are dependent on the PK. Nevertheless, this is not the case in your question.

Now, you can use any aggregation function like MAX(), MIN(), SUM(), etc. I chose to use MAX(), but you can change it for any of them.

The query can run as:

SELECT 
  user_id,
  max(exam_body_id),
  max(exam_date),
  max(subject_ids),
  max(percent_score),
  max(time_advantage)
FROM exam_results 
WHERE exam_body_id = 6674 
  AND exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) 
  AND subject_ids LIKE '%4674%' 
GROUP BY user_id 
ORDER BY max(percent_score) DESC, max(time_advantage) DESC 
LIMIT 10

See running example at DB Fiddle.

Now, why do you need to aggregate the other columns, you ask? Since you are gruping rows the engine needs to produce a single row per group. Therefore, you need to tell the engine which value to pick when there are many values to pick from: the biggest one, the smallest one, the average of them, etc.

In MySQL 5.7.4 or older, the engine didn't require you to aggregate the other columns. The engine silently and randomly decided for you. You may have got the result you wanted today, but tomorrow the engine could choose the MIN() instead of the MAX() without you knowing, therefore leading to unpredictable results every time you run the query.