How can I determine the column level differences between 2 tables?

21 Views Asked by At

Typically the solution for comparing if 2 relational db tables (I am using AWS Athena) are equal is to do full outer join on all the columns but adding an extra column to each dataset that acts as a marker.

select *
from 
(
    (select *, 't1' as t1 from table1 where ...) t1
    full outer join
    (select *, 't2' as t2 from table2 where ...) t2
    on t1.col1 = t2.col1 and t1.col2 = t2.col2 and ...
)
where t1 is null or t2 is null

This gives you the differing rows. But I want to go a step further and find on which columns the row differ.

The problem I'm having is that I'm not sure how to model this, even in a script. If a join failed, that means that the rows where t2 is null (i.e rows in t1 not in t2) and rows where t1 is null (i.e rows in t2 not in t1) could differ on any of the columns or even just not be present in the other table at all. How would I know what 2 rows to compare to find the differences?

I'd appreciate some help modeling the algorithm here. The above query will give me what rows do not have a full match in the other table. But for each row where there's a mismatch, how do I find where the mismatch occurred?

1 way I can think of doing this is once I have the result set of differing rows, I will join the t1 diffs with the t2 diffs on each column (i.e "join on col a", then "join on col a and col b", then "join on col a and col b and col c"), and that will tell me the differences, but I'm sure there are edge cases + that is a lot of queries to do for wide tables.

Another way I can think of is grouping. I would group on all the columns, which gives me a list of groups, and then I order by all the columns. Similar columns would thus show up next to each other, but I don't know if this is fool-proof because what if the columns match on col a, b, d but not col c, and I order by a, b, c, d?

Reminder that this is in AWS Athena, so certain SQL statements are not available.

0

There are 0 best solutions below