Finding Duplicated Values in Pandas Groupby Object

96 Views Asked by Kevin Li At 12 October 2023 at 05:02

I have a Pandas DataFrame:

msg_id	identifier
001	Stackoverflow
001	Stackoverflow
002	Stackoverflow
002	Cross-Validated

I want to drop the duplicated values in identifier for each unique value of msg_id

This is my current apporach which is super slow:

acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
    df = group[group.duplicated("identifier")]
    if len(df) > 0:
        acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df

I have a very large dataset with 500 million rows. Even after filtering for the msg_id that has more than one identifier comes at the very large number.

I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading

Original Q&A

There are 2 best solutions below

Panda Kim On 12 October 2023 at 05:20 BEST ANSWER

Code

The problem is to find rows where the values of two columns are duplicated, not grouped. This is possible as follows.

df[df.duplicated(['msg_id', 'identifier'])]

who-cares2023 On 12 October 2023 at 05:14

You can use vectorized operations in Pandas rather than using explicit loops, which should be faster than your current approach.

data = {
    'msg_id': ['001', '001', '002', '002'],
    'identifier': ['Stackoverflow', 'Stackoverflow', 'Stackoverflow', 'Cross-Validated']
}
df = pd.DataFrame(data)
df.sort_values(['msg_id', 'identifier'], inplace=True)
df['is_duplicated'] = df.duplicated(subset=['msg_id', 'identifier'], keep='first')
result = df[~df['is_duplicated']].drop(columns=['is_duplicated'])

Finding Duplicated Values in Pandas Groupby Object

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DUPLICATES

Related Questions in DATA-SCIENCE

Related Questions in AGGREGATION

Trending Questions

Popular # Hahtags

Popular Questions