I have a Pandas DataFrame:
| msg_id | identifier |
|---|---|
| 001 | Stackoverflow |
| 001 | Stackoverflow |
| 002 | Stackoverflow |
| 002 | Cross-Validated |
I want to drop the duplicated values in identifier for each unique value of msg_id
This is my current apporach which is super slow:
acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
df = group[group.duplicated("identifier")]
if len(df) > 0:
acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df
I have a very large dataset with 500 million rows. Even after filtering for the msg_id that has more than one identifier comes at the very large number.
I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading
Code
The problem is to find rows where the values of two columns are duplicated, not grouped. This is possible as follows.