delete duplicates based on a combination of 2 columns

65 Views Asked by x89 At 29 January 2024 at 19:16

I consider a row duplicate when the combination of hash_key and load_date is repeating. I can find such rows using this CTE:

with w_duplicated_rows as (
    select hash_key, load_date, count(*),
    --row_number() over (partition by hash_key, load_date order by load_date desc) as row_num
    from my_table
    group by hash_key, load_date having count(*) > 1
)

This gives me the combinations which are being repeated in the table.

Now, I want to delete duplicates from the original my_table. I have another column called STATUS. While deleting duplicates, I want to delete the row after filtering on ORDER BY STATUS DESC. Means I want to keep the top row and remove the second one (where status is lower).

How can I achieve this?

Original Q&A

There are 2 best solutions below

Illia Nova On 29 January 2024 at 19:38

I can suggest using ROW_NUMBER() window function to assign a row number to each row within the duplicate groups, ordered by STATUS DESC. Then, you can use a common table expression (CTE) to identify the rows with row numbers greater than 1 and delete them from the original table.

Here's an example query:

WITH ranked_rows AS (
    SELECT
        hash_key,
        load_date,
        STATUS,
        ROW_NUMBER() OVER (PARTITION BY hash_key, load_date ORDER BY STATUS DESC) AS row_num
    FROM my_table
    WHERE (hash_key, load_date) IN (
        SELECT hash_key, load_date
        FROM my_table
        GROUP BY hash_key, load_date
        HAVING COUNT(*) > 1
    )
)
DELETE FROM my_table
WHERE (hash_key, load_date) IN (
    SELECT hash_key, load_date
    FROM ranked_rows
    WHERE row_num > 1
);

Make sure to test this query on a backup before applying it to the entire table.

Error_2646 On 30 January 2024 at 14:52

If you've truly got no field to break ties maybe ctid is the best it'll get. Hard to tell if you are using postgres or redshift. Redshift appears to have a similar concept with row_id. But read the documentation.

Rationale - Basic strategy to preferentially select the duplicate with the highest status, but add the system unique identifier to break ties.

Fiddle: https://dbfiddle.uk/MDBsecyC

create table some_test_data
  (
    hash_key varchar(255),
    load_date date,
    status integer
  );
insert into some_test_data values
  ('abc', '2024-01-01', 0),
  ('abc', '2024-01-01', 1),
  ('abc', '2024-01-01', 1),
  ('cde', '2024-01-02', 1),
  ('efg', '2024-01-03', 0),
  ('efg', '2024-01-04', 0);

with duplicates_to_delete as 
  (
    select *,
           ctid,
           row_number() over ( partition by
             hash_key,
             load_date
             order by status desc,
                      ctid desc) as rn
      from some_test_data
  )
delete from some_test_data del_target
  where exists
          (
            select 1
              from duplicates_to_delete dtd
             where del_target.hash_key = dtd.hash_key
               and del_target.load_date = dtd.load_date
               and del_target.ctid = dtd.ctid
               and dtd.rn > 1
          );

delete duplicates based on a combination of 2 columns

There are 2 best solutions below

Related Questions in SQL

Related Questions in POSTGRESQL

Related Questions in JOIN

Related Questions in AMAZON-REDSHIFT

Related Questions in AMAZON-REDSHIFT-SERVERLESS

Trending Questions

Popular # Hahtags

Popular Questions