Fill na values in dataframe after merge

201 Views Asked by At

I have 2 dataframes I want to merge on first name and contact email returning all of the values from df2.

Sample of data:

df1 = pd.DataFrame([['Elle', 'Kelly', '[email protected]', 2],
['Amanda','Johnson', '[email protected]', 5], 
['Jay', 'Rogers', '[email protected]', 4], 
['David','Connor', '[email protected]', 3],
['Rachel','Connor','[email protected]', 2],
['Anne','Smith','[email protected]', 4],
['Anne','Smith', '[email protected]', 2],
['Dani', 'Carter', '[email protected]', 3],
['Drake', 'Walker', '[email protected]', 2]], 
columns = ['First Name', 'Last Name', 'Email', 'Rating'])

df2 = pd.DataFrame([[np.nan, np.nan, np.nan, 1040, 'City'], 
['Dani','Carter-Hampton', '[email protected]', 1040, 'New York'],
['Anne','Smith','[email protected]', 1040, 'New York'], 
['David', 'Connor', '[email protected]', 1040, 'Chicago'], 
['Jay', 'Rogers','[email protected]', 1040, 'Los Angeles'], 
['Anne','Smith', '[email protected]', 1040, 'Houston'],
['Amanda','Johnson','[email protected]', 1040, 'Los Angeles'],
['Rachel', 'Connor', '[email protected]', 1040, 'Chicago'],
['Elle', 'Moore-Kelly', '[email protected]', 1040, 'Los Angeles'],
['Drake', 'Walker', '[email protected]', 1040, 'Los Angeles']],
columns = ['First Name','Last Name','Contact Email','Movie Id','Location'])

merge

merged = df1.merge(df2, left_on = ['First Name', 'Email'], right_on =  ['First Name', 'Contact Email'], how = 'right')

Output of merge enter image description here

The problem is some of the rows are unable to match because of differences between name and email in both tables, so I want to perform another merge on first name and last name to fill in the missing nan values. I tried using the combine_first function but because the indices are different it results in mismatched rows and duplicates. How do I go about this?

Desired Output after the first merge to fill in na values

enter image description here

1

There are 1 best solutions below

2
Zach Flanders On

Building on @Lazyer's comment, I would use pandas method chaining and use combine_first to combine the Last Name_x and Last Name_y columns and also the Email and Contact Email.

merged = (
    df1
    .merge(df2, left_on=['First Name', 'Email'], right_on=['First Name', 'Contact Email'], how ='right')
    .assign(**{
        'Last Name': lambda x: x['Last Name_x'].combine_first(x['Last Name_y']),
        'Email': lambda x: x['Email'].combine_first(x['Contact Email'])
    })
    .drop(['Last Name_x', 'Last Name_y', 'Contact Email'], axis='columns')
)