Finding Duplicated Values in Pandas Groupby Object

96 Views Asked by At

I have a Pandas DataFrame:

msg_id identifier
001 Stackoverflow
001 Stackoverflow
002 Stackoverflow
002 Cross-Validated

I want to drop the duplicated values in identifier for each unique value of msg_id

This is my current apporach which is super slow:

acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
    df = group[group.duplicated("identifier")]
    if len(df) > 0:
        acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df

I have a very large dataset with 500 million rows. Even after filtering for the msg_id that has more than one identifier comes at the very large number.

I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading

2

There are 2 best solutions below

0
Panda Kim On BEST ANSWER

Code

The problem is to find rows where the values of two columns are duplicated, not grouped. This is possible as follows.

df[df.duplicated(['msg_id', 'identifier'])]
2
who-cares2023 On

You can use vectorized operations in Pandas rather than using explicit loops, which should be faster than your current approach.

data = {
    'msg_id': ['001', '001', '002', '002'],
    'identifier': ['Stackoverflow', 'Stackoverflow', 'Stackoverflow', 'Cross-Validated']
}
df = pd.DataFrame(data)
df.sort_values(['msg_id', 'identifier'], inplace=True)
df['is_duplicated'] = df.duplicated(subset=['msg_id', 'identifier'], keep='first')
result = df[~df['is_duplicated']].drop(columns=['is_duplicated'])