PostgreSQL: Delete duplicate rows based on matching md5 hashes

57 Views Asked by Nathan McIntosh At 15 March 2024 at 08:47

I have a table which is populated by data scraped from the web. After adding the new data to the table I want to delete duplicates if no changes have been made to specific columns.

I have tried the following query along with various ways of referencing the md5 hash, and casting column1 as text:

DELETE FROM
    my_table a
        USING my_table b
WHERE
    a.pk < b.pk
    AND a.md5(column1) = b.md5(column1);

and receive the following error:

ERROR: column reference "column1" is ambiguous 
LINE 6: AND a.md5(column1) = b.md5(column1); 
                  ^ 
SQL state: 42702 Character: 99

Original Q&A

There are 1 best solutions below

Zegarek On 15 March 2024 at 11:33 BEST ANSWER

Don't deduplicate based on `md5()`, just use `=`

Let Postgres figure out if it benefits from hashing at all, and if so, what the optimal hashing function should be. It's easier for you to declare, easier for Postgres to optimise and execute. SQL is meant to be declarative - state what you want, not how exactly it's supposed to be done.

That way you also avoid potentially incorrect results - md5 sticks around just for simplicity but it has long been considered unreliable. It's rare but not impossible for your values to be completely different but share the same md5, so you may get undeserved deletes.

If Postgres sees the values are too long to be compared with a plain = effectively, it'll change the value equality into value hash equality behind the scenes, using a hashing function of its own choice:

explain analyze verbose
DELETE FROM my_table a
USING my_table b
WHERE a.pk < b.pk
AND a.column1 = b.column1
returning *;

QUERY PLAN
Delete on public.my_table a (cost=3758.00..10978.49 rows=7480 width=650) (actual time=26.669..68.048 rows=1858 loops=1)
...
-> Hash Join (cost=3758.00..10978.49 rows=7480 width=650) (actual time=26.642..64.441 rows=1985 loops=1)
...
Hash Cond: (a.column1 = b.column1)
...
-> Hash (cost=1867.00..1867.00 rows=20000 width=644) (actual time=26.444..26.445 rows=20000 loops=1)
...
Execution Time: 68.541 ms

Notice how the version with md5 affects the exact same amount of rows, just 4 times slower: demo

explain analyze verbose
DELETE FROM my_table a
USING my_table b
WHERE a.pk < b.pk
AND md5(a.column1) = md5(b.column1)
returning *;


Delete on public.my_table a (cost=18078.54..63328.54 rows=666667 width=650) (actual time=215.633..249.269 rows=1858 loops=1)
Output: a.pk, a.column1, b.pk, b.column1
-> Merge Join (cost=18078.54..63328.54 rows=666667 width=650) (actual time=215.603..246.022 rows=1985 loops=1)
...
Execution Time: 252.893 ms

As a bonus, if future versions of Postgres get improved in terms of the hash-based operations, you'll get that for free with =, without having to revisit and update any code.

PostgreSQL: Delete duplicate rows based on matching md5 hashes

There are 1 best solutions below

Don't deduplicate based on `md5()`, just use `=`

Related Questions in POSTGRESQL

Related Questions in MD5

Related Questions in POSTGRESQL-14

Related Questions in DROP-DUPLICATES

Trending Questions

Popular # Hahtags

Popular Questions

PostgreSQL: Delete duplicate rows based on matching md5 hashes

There are 1 best solutions below

Don't deduplicate based on md5(), just use =

Related Questions in POSTGRESQL

Related Questions in MD5

Related Questions in POSTGRESQL-14

Related Questions in DROP-DUPLICATES

Trending Questions

Popular # Hahtags

Popular Questions

Don't deduplicate based on `md5()`, just use `=`