How do I validate data mapping between 2 data frames in pandas

363 Views Asked by At

I am trying to validate a data mapping between two data frames for specific columns. I need to validate the following:

  1. if values in a specific column in df1 matches the mapping in a specific column in df2.
  2. if values in a specific column in df1 does not match the specified mapping in a specific column in df2 - a different value in df2.
  3. if values in a specific column in df1 does not have a match in in df2.

df1 looks like this:

cp_id cp_code
2A23 A
2A24 D
3A45 G
7A96 B
2A30 R
6A18 K

df2 looks like like:

cp_type_id cp_type_code
2A23 8
2A24 7
3A45 3
2A44 1
6A18 8
4A08 2

The data mapping constitutes of sets of values where the combination could match any values within the set, as following:

  • ('A','C','F','K','M') in df1 should map to (2, 8) in df2 - either 2 or 8
  • ('B') in df1 should map to 4 in df2
  • ('D','G','I') in df1 should map to 7 in df2
  • ('T','U') in df1 should map to (3,5) in df2 - either 3 or 5

Note that df1 has a cp_code as R which is not mapped and that 3A45 is a mismatch. The good news is there is a unique identifier key to use.

First, I created a list for each mapping set and created a statement using merge to check for each mapping. I ended up with 3 lists and 3 statements per set, which I am not sure if this is the right way to do it.

At the end I want to combine the matches into one df that I call match, all no_matches into another df that I call no_match, and all no_mappings into another df that I call no_mapping, like the following:

Match

cp_id cp_code cp_type_id cp_type_code
2A23 A 2A23 8
2A24 D 2A24 7
6A18 K 6A18 8

Mismatch

cp_id cp_code cp_type_id cp_type_code
3A45 G 3A45 3

No Mapping

cp_id cp_code cp_type_id cp_type_code
7A96 B NaN NaN
NaN NaN 2A44 1
2A30 R NaN NaN
NaN NaN 4A08 2

I am having a hard time to make the no_match to work. This is what I tried for no match:

  1. filtered df1 based on the set 2 codes
  2. filtered df2 based on not in map 2 codes
  3. for the no mapping, I did a df merge with on='cp_id'
    no_mapping_set2 = df1_filtered.merge(df2_filtered, on='cp_id', indicator = True)

With the code above, for cp_id = 'B', for example, instead of getting only 1 row back, I get a lot of duplicate rows with cp_id = 'B'.

Just to state my level, I am a beginner in Python. Any help would be appreciated.

Thank you so much for your time.

Rob

0

There are 0 best solutions below