I would like to automate this processes for multiple datasets. I am not sure if there exists a tool like Alteryx or any software where I can do the following.

  • Compare 2 similar datasets on a block aggregate level.
  • If mismatch exists, zoom into that block to find where the inconsistencies occur.
  • Get rows of data with ID where inconsistency happens. The goal is to explore on a high level because of the huge amount of data and open a more comprenhesive view of the inconsistency to point out the error easily.

Here is the data that exists in both SIT and PROD:

ID Block Amt A Amt B Amt C ....
1 X 1000 400 1000 ....
2 Y 2000 500 4000 ....
3 Z 3000 500 3000 ....
2 Y 2000 3000 2000 ....
3 Z 3000 4500 3000 ....
  • This is my PROD data after aggregating on 'block'
Block Counts Sum Amt A Sum Amt B
X 1000 40000 10000
Y 2000 45000 20000
Z 3000 50000 30000
  • This is my SIT data after aggregating on 'block'
Block Counts Sum Amt A Sum Amt B
X 1000 40000 15000
Y 2000 47000 20000
Z 3000 50000 30000

FYI - There are multiple columns like block that could be used on aggregation. One could choose any column, get an aggregation to find inconsistencies and then explore into that particular block.

1

There are 1 best solutions below

1
Xue On

It's a typical full join, then filter process, you can do it in SQL, Python dataframe or Alteryx, pretty much anything

`
WITH
sum_prod AS
(
 SELECT
  SUM(AMT_A) SUM_AMT_A,
  COUNT(1),..
 FROM prod
 GROUP BY ...
)
,
sum_SIT AS
(
 SELECT
  SUM(AMT_A) SUM_AMT_A,
  COUNT(1),..
 FROM sit
 GROUP BY ...
)

SELECT 
*
FROM sum_prod
FULL JOIN
sum_sit
ON
sum_prod.block = sum_sit.block
WHERE
 sum_prod.Sum_Amt_A <> sum_sit.Sum_Amt_A
 OR sum_prod.Sum_Amt_A IS NULL
 OR sum_sit.Sum_Amt_A IS NULL  -- always take care of one side missing
 --add however_your_condition_to_tell_diff

`

Same principle in Alteryx, just a join and a selection filter will do