How to Find Duplicates Over Multiple Columns Using SQL Server

168 Views Asked by At

Is it possible to find duplicates in a SQL table across multiple columns in a way that only requires a match by one of the columns?

For example, lets say I have a table with the following Schema: ID, C1, C2

My goal is to return a new table with a column called "Group ID" which is the ID of the group the record lives. A given record should exist in the group if there is a match by at least one of C1 or C2.

For example, consider this table

Create Table #Target

(ID Varchar(MAX),C1 Varchar(MAX),C2 Varchar(MAX))

Insert INTO #Target

Values

('1','p1','e1'),
('2','p2','e2'),
('3','p1','e2'),
('4','p3','e3'),
('5','p3','e4'),
('6','p4','e4')
ID C1 C2
1 p1 e1
2 p2 e2
3 p1 e2
4 p3 e3
5 p3 e4
6 p4 e4

The desired output would look something like

ID C1 C2 GID
1 p1 e1 G1
2 p2 e2 G1
3 p1 e2 G1
4 p3 e3 G2
5 p3 e4 G2
6 p4 e4 G2

When trying to come up with a solution I tried grouping by each column individually like this:

#1 Group By C1 and Assign Unique ID (C1GID) to each Group

#2 Group By C2 and Assign Unique ID (C2GID) to each Group

Now the problem I am having is how to group by C1GID OR C2GID. I don't know how to group rows that share at least one of these column values in common.

Update: I am getting closer. I am able to generate a set of IDs that have a match by at least one of the columns using this

SELECT *, CAST(NULL AS INT) AS ID_To INTO #t FROM ( VALUES ('1','p1','e1'), ('2','p2','e2'), ('3','p1','e2'), ('4','p3','e3'), ('5','p3','e4'), ('6','p4','e4') ) t (ID,C1,C2)

Select ID1, STRING_AGG(ID3, ', ') + ',' + STRING_AGG(ID2, ', ') as Groups
FROM (
    select 
    t1.ID as ID1,
    t1.C1 as t1C1,
    t1.C2 as t1C2,

    t2.ID as ID2,
    t2.C1 as t2C1,
    t2.C2 as t2C2,

    t3.ID as ID3,
    t3.C1 as t3C1,
    t3.C2 as t3C2
    from #t t1
    LEFT JOIN (
        Select * From #t
    ) t2 ON t1.C1 = t2.C1 
    LEFT JOIN (
        Select * From #t
    ) t3 ON t1.C2 = t3.C2
    WHERE t1.C1 = t2.C1 OR t1.C2 = t3.C2
) Groups 
GROUP BY ID1

Image here: Output

I am trying to find a way to re-group these by finding an intersection between the values in the Groups column. For example its clear from the result that 1,2,3 belong together, and 4,5,6 belong together because their groups column has a common ID.

1

There are 1 best solutions below

2
p3consulting On

Try this one:

with firstc1s(gid, id,c1) as (
    select row_number() over(order by id) as gid, id, c1 from (
        select min(id) as id, c1
        from #Target 
        group by c1
    ) x
)
,cte(lvl, gid, id, c1, c2, p) as (
    select 1, f.gid, f.id, d.c1, d.c2, concat('/', f.id, ' ')
    from firstc1s f
        join #Target d on d.id = f.id
        
    union all
    
    select lvl+1, c.gid, d.id, d.c1, d.c2, concat(c.p , '/' , d.id, ' ')
    from cte c
    join #Target d on  
        (
            c.c1 = d.c1 or c.c1 = d.c2
            or 
            c.c2 = d.c1 or c.c2 = d.c2
        )
    where charindex(concat('/' , d.id, ' '), c.p) = 0
)
select 
    dense_rank() over(order by gid) as gid, id,c1,c2 
from (
    select min(gid) as gid, id, c1, c2 from cte
    group by id, c1, c2
) x
order by gid, id, c1, c2
;

https://dbfiddle.uk/-nIlBCer