I am trying to validate a data mapping between two data frames for specific columns. I need to validate the following:
- if values in a specific column in df1 matches the mapping in a specific column in df2.
- if values in a specific column in df1 does not match the specified mapping in a specific column in df2 - a different value in df2.
- if values in a specific column in df1 does not have a match in in df2.
df1 looks like this:
| cp_id | cp_code |
|---|---|
| 2A23 | A |
| 2A24 | D |
| 3A45 | G |
| 7A96 | B |
| 2A30 | R |
| 6A18 | K |
df2 looks like like:
| cp_type_id | cp_type_code |
|---|---|
| 2A23 | 8 |
| 2A24 | 7 |
| 3A45 | 3 |
| 2A44 | 1 |
| 6A18 | 8 |
| 4A08 | 2 |
The data mapping constitutes of sets of values where the combination could match any values within the set, as following:
- ('A','C','F','K','M') in df1 should map to (2, 8) in df2 - either 2 or 8
- ('B') in df1 should map to 4 in df2
- ('D','G','I') in df1 should map to 7 in df2
- ('T','U') in df1 should map to (3,5) in df2 - either 3 or 5
Note that df1 has a cp_code as R which is not mapped and that 3A45 is a mismatch. The good news is there is a unique identifier key to use.
First, I created a list for each mapping set and created a statement using merge to check for each mapping. I ended up with 3 lists and 3 statements per set, which I am not sure if this is the right way to do it.
At the end I want to combine the matches into one df that I call match, all no_matches into another df that I call no_match, and all no_mappings into another df that I call no_mapping, like the following:
Match
| cp_id | cp_code | cp_type_id | cp_type_code |
|---|---|---|---|
| 2A23 | A | 2A23 | 8 |
| 2A24 | D | 2A24 | 7 |
| 6A18 | K | 6A18 | 8 |
Mismatch
| cp_id | cp_code | cp_type_id | cp_type_code |
|---|---|---|---|
| 3A45 | G | 3A45 | 3 |
No Mapping
| cp_id | cp_code | cp_type_id | cp_type_code |
|---|---|---|---|
| 7A96 | B | NaN | NaN |
| NaN | NaN | 2A44 | 1 |
| 2A30 | R | NaN | NaN |
| NaN | NaN | 4A08 | 2 |
I am having a hard time to make the no_match to work. This is what I tried for no match:
- filtered df1 based on the set 2 codes
- filtered df2 based on not in map 2 codes
- for the no mapping, I did a df merge with on='cp_id'
no_mapping_set2 = df1_filtered.merge(df2_filtered, on='cp_id', indicator = True)
With the code above, for cp_id = 'B', for example, instead of getting only 1 row back, I get a lot of duplicate rows with cp_id = 'B'.
Just to state my level, I am a beginner in Python. Any help would be appreciated.
Thank you so much for your time.
Rob