delete duplicates based on a combination of 2 columns

65 Views Asked by At

I consider a row duplicate when the combination of hash_key and load_date is repeating. I can find such rows using this CTE:

with w_duplicated_rows as (
    select hash_key, load_date, count(*),
    --row_number() over (partition by hash_key, load_date order by load_date desc) as row_num
    from my_table
    group by hash_key, load_date having count(*) > 1
)

This gives me the combinations which are being repeated in the table.

Now, I want to delete duplicates from the original my_table. I have another column called STATUS. While deleting duplicates, I want to delete the row after filtering on ORDER BY STATUS DESC. Means I want to keep the top row and remove the second one (where status is lower).

How can I achieve this?

2

There are 2 best solutions below

12
Illia Nova On

I can suggest using ROW_NUMBER() window function to assign a row number to each row within the duplicate groups, ordered by STATUS DESC. Then, you can use a common table expression (CTE) to identify the rows with row numbers greater than 1 and delete them from the original table.

Here's an example query:

WITH ranked_rows AS (
    SELECT
        hash_key,
        load_date,
        STATUS,
        ROW_NUMBER() OVER (PARTITION BY hash_key, load_date ORDER BY STATUS DESC) AS row_num
    FROM my_table
    WHERE (hash_key, load_date) IN (
        SELECT hash_key, load_date
        FROM my_table
        GROUP BY hash_key, load_date
        HAVING COUNT(*) > 1
    )
)
DELETE FROM my_table
WHERE (hash_key, load_date) IN (
    SELECT hash_key, load_date
    FROM ranked_rows
    WHERE row_num > 1
);

Make sure to test this query on a backup before applying it to the entire table.

0
Error_2646 On

If you've truly got no field to break ties maybe ctid is the best it'll get. Hard to tell if you are using postgres or redshift. Redshift appears to have a similar concept with row_id. But read the documentation.

Rationale - Basic strategy to preferentially select the duplicate with the highest status, but add the system unique identifier to break ties.

Fiddle: https://dbfiddle.uk/MDBsecyC

create table some_test_data
  (
    hash_key varchar(255),
    load_date date,
    status integer
  );
insert into some_test_data values
  ('abc', '2024-01-01', 0),
  ('abc', '2024-01-01', 1),
  ('abc', '2024-01-01', 1),
  ('cde', '2024-01-02', 1),
  ('efg', '2024-01-03', 0),
  ('efg', '2024-01-04', 0);

with duplicates_to_delete as 
  (
    select *,
           ctid,
           row_number() over ( partition by
             hash_key,
             load_date
             order by status desc,
                      ctid desc) as rn
      from some_test_data
  )
delete from some_test_data del_target
  where exists
          (
            select 1
              from duplicates_to_delete dtd
             where del_target.hash_key = dtd.hash_key
               and del_target.load_date = dtd.load_date
               and del_target.ctid = dtd.ctid
               and dtd.rn > 1
          );