Duplicate rows and incorrect indicators in merged DataFrame with Pandas merge

85 Views Asked by At

The date & function is:

  • 2 same DF namely left_df and right_df:
pd.DataFrame({'Column1': ['dummy_text1'],
              'Column2': [''],
              'Column3': [''],
              'Column4': ['dummy_text2'],
              'Column5': [''],
              'Column6': ['Dummy Role 1']
              })
  • the function
left_df.merge(right_df, 
              on=['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'], 
              how='outer', 
              indicator=True)

The question is:

the result contains two rows, one as 'left_only' and one as 'right_only', instead of a single row with 'both' as indicator.

Other info.:

I have already tried the following steps to address the issue:

  • Checked for leading/trailing whitespaces in the merge columns, but there are none.
  • Verified that there are no duplicates in either DataFrame based on the merge columns.
  • Used drop_duplicates() with inplace=True and keep=False on both DataFrames before the merge.
  • Ensured that the merge columns have the same data types in both DataFrames.

Edit:

Sharing code snippet: I forgot to mention this part, this is to resolve case of the values:

left_df['Column1_U'] = left_df['Column1'].copy()
left_df['Column6_U'] = left_df['Column6'].copy()
right_df['Column1_U'] = right_df['Column1'].copy()
right_df['Column6_U'] = right_df['Column6'].copy()

this is the rest of the code:

columns_to_exclude = ['Column1_U', 'Column6_U']

left_df = left_df.apply(lambda s: s.str.lower() if s.name not in columns_to_exclude else s)
right_df = right_df.apply(lambda s: s.str.lower() if s.name not in columns_to_exclude else s)

merged_df = merge(left = left_df
                ,right = right_df
                ,on=['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6']
                ,how='outer'
                ,indicator=True)

but the output that i'm getting is this: Column1,Column2,Column3,Column4,Column5,Column6,Column1_U_x,Column6_U_x,Column1_U_y,Column6_U_y,_merge dummy_text1,,,dummy_text1,,dummy role 1,Dummy_text1,Dummy Role 1,,,left_only dummy_text1,,,dummy_text1,,dummy role 1,,,Dummy_text1,Dummy Role 1,right_only

1

There are 1 best solutions below

0
user9001 On BEST ANSWER

It is solved. The issue turned out to be with NULL value and '', both these dataframes were getting created from SQL tables and in the query I was using ISNULL(Column, '') as Column in one and NULL AS Column in the other. Turns out after doing dataframe.fillna(''), it was treating the '' differently between both dataframes, after I updated the queries the issue was resolved.