Merging two datasets with repeated IDs

Question

Merging two datasets with repeated IDs

97 Views Asked by Carlos Manuel Arroyo At 03 May 2023 at 19:40

I want to merge two datasets using pandas. The code works but the output is not quite what I was expecting because as I'm merging based on Id's, one can appear multiple times but have different values in other columns.

My df1 is the following:

SubscriberKey	SubscriberId	MONTH	INCOME
96346d046d42d923ed97d974f26addce04fa7324b3e1f9e69a31f297073ca06f	125557370	4	204
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057	125557375	7	329
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	3	144
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	6	82

Where each row correspond to one user. However as in df2, there can be more than one entry per user, because each entry corresponds to the income per month of one user. Having said this, I want to merge another datset into this one using based on the Subscriber Id. The dataset I would like to merge is df2:

SubscriberId	MONTH	SENT_EMAILS
125557388	4	1
125557388	8	1
125557388	1	1
125557388	6	1
125557400	4	1
125557400	6	1

As you can see this user appears 4 times in this one, where we can see how many emails each user sent in a month. I used the code pd.merge(df1, df2, on='SubscriberId) and got the following result:

SubscriberKey	SubscriberId	Month_x	Income	Month_y	SENT_EMAILS
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057	125557375	7	329	2	1
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057	125557375	7	329	5	1
f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057	125557375	7	329	6	2
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	3	144	4	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	3	144	8	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	3	144	1	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	3	144	6	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	6	82	4	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	6	82	8	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	6	82	1	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	6	82	6	1

What do I need to add to my code so that the end result looks like this:

SubscriberKey	SubscriberId	MONTH	INCOME	SENT_EMAILS
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	4		1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	8		1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	1		1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	6	82	1
9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd	125557388	3	144

I want to avoid duplicating the MONTH column and for it to only be filled with either its "sent_emails" part or the "income part", not all the possible combinations based on the Id.

Original Q&A

There are 2 best solutions below

**Timeless** · Answer 1 · 2023-05-03T20:46:22.323000

IIUC, you can try this :

d = df1.set_index("SubscriberId")["SubscriberKey"].to_dict()
tmp = df1.merge(df2, on=["MONTH", "SubscriberId"]).drop_duplicates()

out = (pd.concat([df1, df2])
            .loc[lambda x: x["SubscriberId"].isin(tmp["SubscriberId"])]
            .groupby(["SubscriberId", "MONTH"], as_index=False, sort=False).first() 
            .assign(SubscriberKey= lambda x: x["SubscriberId"].map(d))
             [list(df1.columns) + ["SENT_EMAILS"]]
      )

Output :

print(out)

                                                      SubscriberKey  SubscriberId  MONTH  INCOME  SENT_EMAILS
0  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      3  144.00          NaN
1  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      6   82.00         1.00
2  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      4     NaN         1.00
3  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      8     NaN         1.00
4  9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd     125557388      1     NaN         1.00

**tura** · Answer 2 · 2023-05-04T09:04:21.300000

tura On 04 May 2023 at 09:04

result = pd.merge(df1, df2, on=['SubscriberId', 'MONTH'], how='left')

This should work!

Merging two datasets with repeated IDs

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in MERGE

Related Questions in DATASET

Related Questions in PANDAS-MERGE

Trending Questions

Popular # Hahtags

Popular Questions