Merge two Dataframe without duplicate rows pandas

63 Views Asked by At

We have problem in merge two different Pandas data frame, using the merge method it duplicate the rows with the same ID

We have two different DF in Pandas:

DF1:

   ID  Gender
0   1    Male
1   2  Female
2   3  Female

DF2

   ID  Vote
0   1    30
1   2    27
2   2    22

We want this result as output:

   ID  Gender  Vote 1  Vote 2
0   1    Male      30    <NA>
1   2  Female      27      22
2   3  Female    <NA>    <NA>
2

There are 2 best solutions below

0
Corralien On BEST ANSWER

You have to reshape your second dataframe before merging to the first one:

df2a = df2.pivot_table(index='ID', columns=df2.groupby('ID').cumcount().add(1), 
                       values='Vote').add_prefix('Vote ')

out = df1.merge(df2a, on='ID', how='left')
print(out)

# Output
   ID  Gender  Vote 1  Vote 2
0   1    Male    30.0     NaN
1   2  Female    27.0    22.0
2   3  Female     NaN     NaN
0
Akshay Rajeev On

Another way to do this is as follows:

We have first dataframe:

data1 = [[1, 'Male'], [2, 'Female'], [3, 'Female']]
df1 = pd.DataFrame(data1, columns=['ID', 'Gender'])

# Output
ID  Gender
1   Male
2   Female
3   Female

And the second dataframe:

data2 = [[1, 30], [2, 27], [2, 22]]
df2 = pd.DataFrame(data2, columns=['ID', 'Vote'])

# Output
ID  Vote
1   30
2   27
2   22

We first merge:

merged_df = df1.merge(df2, on='ID', how='outer')

#Output:

   ID   Gender  Vote
0   1   Male    30.0
1   2   Female  27.0
2   2   Female  22.0
3   3   Female  NaN

Assuming that df1 have all unique rows, you can create your required dataframe as follows:

final_df_list = []
for id in merged_df.ID.unique():
    prefix = 'Vote'
    suffix = 1
    current_df = merged_df.loc[merged_df.ID == id]
    gender = merged_df.loc[merged_df.ID == id]['Gender'].values[0]
    data = [[id, gender]]
    columns = ['ID', 'Gender']
    for vote in current_df.Vote.values:
        data[0].append(vote)
        columns.append(f'{prefix}{suffix}')
        suffix += 1
    final_df_list.append(pd.DataFrame(data=data, columns=columns))
output_df = pd.concat(final_df_list, sort = False)

# output_df:

   ID   Gender  Vote1   Vote2
0   1   Male    30.0    NaN
0   2   Female  27.0    22.0
0   3   Female  NaN     NaN