I have a table which is populated by data scraped from the web. After adding the new data to the table I want to delete duplicates if no changes have been made to specific columns.
I have tried the following query along with various ways of referencing the md5 hash, and casting column1 as text:
DELETE FROM
my_table a
USING my_table b
WHERE
a.pk < b.pk
AND a.md5(column1) = b.md5(column1);
and receive the following error:
ERROR: column reference "column1" is ambiguous LINE 6: AND a.md5(column1) = b.md5(column1); ^ SQL state: 42702 Character: 99
Don't deduplicate based on
md5(), just use=Let Postgres figure out if it benefits from hashing at all, and if so, what the optimal hashing function should be. It's easier for you to declare, easier for Postgres to optimise and execute. SQL is meant to be declarative - state what you want, not how exactly it's supposed to be done.
That way you also avoid potentially incorrect results - md5 sticks around just for simplicity but it has long been considered unreliable. It's rare but not impossible for your values to be completely different but share the same md5, so you may get undeserved deletes.
If Postgres sees the values are too long to be compared with a plain
=effectively, it'll change the value equality into value hash equality behind the scenes, using a hashing function of its own choice:Notice how the version with
md5affects the exact same amount of rows, just 4 times slower: demoAs a bonus, if future versions of Postgres get improved in terms of the hash-based operations, you'll get that for free with
=, without having to revisit and update any code.