Merging two datasets with repeated IDs

97 Views Asked by At

I want to merge two datasets using pandas. The code works but the output is not quite what I was expecting because as I'm merging based on Id's, one can appear multiple times but have different values in other columns.

My df1 is the following:

SubscriberKey SubscriberId MONTH INCOME
96346d046d42d923ed97d974f26addce04fa7324b3e1f9e69a31f297073ca06f 125557370 4 204
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 125557375 7 329
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 3 144
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 6 82

Where each row correspond to one user. However as in df2, there can be more than one entry per user, because each entry corresponds to the income per month of one user. Having said this, I want to merge another datset into this one using based on the Subscriber Id. The dataset I would like to merge is df2:

SubscriberId MONTH SENT_EMAILS
125557388 4 1
125557388 8 1
125557388 1 1
125557388 6 1
125557400 4 1
125557400 6 1

As you can see this user appears 4 times in this one, where we can see how many emails each user sent in a month. I used the code pd.merge(df1, df2, on='SubscriberId) and got the following result:

SubscriberKey SubscriberId Month_x Income Month_y SENT_EMAILS
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 125557375 7 329 2 1
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 125557375 7 329 5 1
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 125557375 7 329 6 2
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 3 144 4 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 3 144 8 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 3 144 1 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 3 144 6 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 6 82 4 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 6 82 8 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 6 82 1 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 6 82 6 1

What do I need to add to my code so that the end result looks like this:

SubscriberKey SubscriberId MONTH INCOME SENT_EMAILS
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 4 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 8 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 1 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 6 82 1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd 125557388 3 144

I want to avoid duplicating the MONTH column and for it to only be filled with either its "sent_emails" part or the "income part", not all the possible combinations based on the Id.

2

There are 2 best solutions below

5
Timeless On

IIUC, you can try this :

d = df1.set_index("SubscriberId")["SubscriberKey"].to_dict()
tmp = df1.merge(df2, on=["MONTH", "SubscriberId"]).drop_duplicates()
​
out = (pd.concat([df1, df2])
            .loc[lambda x: x["SubscriberId"].isin(tmp["SubscriberId"])]
            .groupby(["SubscriberId", "MONTH"], as_index=False, sort=False).first() 
            .assign(SubscriberKey= lambda x: x["SubscriberId"].map(d))
             [list(df1.columns) + ["SENT_EMAILS"]]
      )

Output :

print(out)

                                                      SubscriberKey  SubscriberId  MONTH  INCOME  SENT_EMAILS
0  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      3  144.00          NaN
1  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      6   82.00         1.00
2  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      4     NaN         1.00
3  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      8     NaN         1.00
4  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      1     NaN         1.00
0
tura On
result = pd.merge(df1, df2, on=['SubscriberId', 'MONTH'], how='left')

This should work!