I want to merge two datasets using pandas. The code works but the output is not quite what I was expecting because as I'm merging based on Id's, one can appear multiple times but have different values in other columns.
My df1 is the following:
| SubscriberKey | SubscriberId | MONTH | INCOME |
|---|---|---|---|
| 96346d046d42d923ed97d974f26addce04fa7324b3e1f9e69a31f297073ca06f | 125557370 | 4 | 204 |
| f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 | 125557375 | 7 | 329 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 3 | 144 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 6 | 82 |
Where each row correspond to one user. However as in df2, there can be more than one entry per user, because each entry corresponds to the income per month of one user. Having said this, I want to merge another datset into this one using based on the Subscriber Id. The dataset I would like to merge is df2:
| SubscriberId | MONTH | SENT_EMAILS |
|---|---|---|
| 125557388 | 4 | 1 |
| 125557388 | 8 | 1 |
| 125557388 | 1 | 1 |
| 125557388 | 6 | 1 |
| 125557400 | 4 | 1 |
| 125557400 | 6 | 1 |
As you can see this user appears 4 times in this one, where we can see how many emails each user sent in a month. I used the code pd.merge(df1, df2, on='SubscriberId) and got the following result:
| SubscriberKey | SubscriberId | Month_x | Income | Month_y | SENT_EMAILS |
|---|---|---|---|---|---|
| f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 | 125557375 | 7 | 329 | 2 | 1 |
| f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 | 125557375 | 7 | 329 | 5 | 1 |
| f75e979a030f595ba091f0a060135b733c98345b62836a278e221f503f879057 | 125557375 | 7 | 329 | 6 | 2 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 3 | 144 | 4 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 3 | 144 | 8 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 3 | 144 | 1 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 3 | 144 | 6 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 6 | 82 | 4 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 6 | 82 | 8 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 6 | 82 | 1 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 6 | 82 | 6 | 1 |
What do I need to add to my code so that the end result looks like this:
| SubscriberKey | SubscriberId | MONTH | INCOME | SENT_EMAILS |
|---|---|---|---|---|
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 4 | 1 | |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 8 | 1 | |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 1 | 1 | |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 6 | 82 | 1 |
| 9f355d3154658f70ea6104cf4d5581f1c57c28c956dcbd49370e3d004ea8ecbd | 125557388 | 3 | 144 |
I want to avoid duplicating the MONTH column and for it to only be filled with either its "sent_emails" part or the "income part", not all the possible combinations based on the Id.
IIUC, you can try this :
Output :