I have a table in Redshift which I can access by SQL, and this table gets rows regularly added to it. The updates are both on a monthly interval (major updates) and on a shorter interval (minor updates), kept track of by the id column. Suppose this table has three columns, id, val1 and val2. The id column is something like NNaaMM, where:
NNis two digits counting the major updates, starting with00,01,...-
aais two characters indicating the group (which for this question may be considered fixed) MMis two digits counting the minor updates, starting with00,01,...
The columns val1 and val2 have the information I want to compare. I would like the two most recent large updates, using each of their respective latest minor updates. That is, if the latest large update has id=54aa12 and the large update before that has id=53aa02, then I would like to get a table that looks like the following:
| id | val1 | val2 | which |
|---|---|---|---|
| 54aa12 | foo | 6 | current |
| 54aa12 | bar | 5 | current |
| 53aa02 | foo | 10 | previous |
| 53aa02 | baz | 12 | previous |
So far I have an extremely long and slow query that seems to do the job:
WITH prefixes AS (
SELECT SUBSTRING(id,1,3) AS prefix
FROM table
WHERE id LIKE '%aa%'
GROUP BY prefix
ORDER BY prefix DESC
LIMIT 2),
id_max AS (
SELECT t.id AS id
FROM prefixes
JOIN table t
ON t.id LIKE (SELECT MAX(prefix) FROM prefixes)+'%'
GROUP BY id
ORDER BY id DESC
LIMIT 1),
id_min AS (
SELECT t.id AS id
FROM prefixes
JOIN table t
ON t.id LIKE (SELECT MIN(prefix) FROM prefixes)+'%'
GROUP BY id
ORDER BY id DESC
LIMIT 1)
(SELECT *, 'current' AS which
FROM table
WHERE id = (SELECT MAX(id) FROM id_max))
UNION ALL
(SELECT *, 'previous' AS which
FROM table
WHERE id= (SELECT MAX(id) FROM id_min))
My question is: Is there a better / faster / cleaner way to do this?
I'm aware everything is working (and so why am I posting here...), but right now it feels like a hack that only I can understand, and this code is part of a bigger project that I have to share with others. So code readability (and shortness if possible) is important, I'm hoping to have something like a single SELECT .. WHERE id=CURRENT(..) OR id=PREVIOUS(..), and I know these functions don't exist, just that format would be preferrable.
I suspect there are more cases than your sample data covers, but row_number() can be used here.
Rationale - Partition by all the records which have minor updates, of those row_number() ordered by the first three characters will give you 1 & 2 for your latest two and >2 for everything else.