Function to filter a dataframe based on multiple conditions with groupby and dropping of duplicates

46 Views Asked by At

I have a dataframe and would like to create a function to keep rows or drop duplicates based on certain conditions

original dataframe

year  year_month   manager_movement    email_address
2022  2022_jun     transfer_in         [email protected]
2022  2022_jun     no_change           [email protected]
2022  2022_jul     no_change           [email protected]
2022  2022_jul     no_change           [email protected]
2022  2022_aug     no_change           [email protected]
2022  2022_aug     no_change           [email protected]
2022  2022_sep     transfer_out        [email protected]
2022  2022_sep     no_change           [email protected]
2022  2022_oct     transfer_in         [email protected]
2022  2022_oct     no_change           [email protected]
2023  2023_jan     no_change           [email protected]
2023  2023_feb     no_change           [email protected]

Expected dataframe

year  year_month   manager_movement    email_address
2022  2022_jun     transfer_in         [email protected]
2022  2022_oct     transfer_in         [email protected]
2022  2022_oct     no_change           [email protected]
2023  2023_feb     no_change           [email protected]

The logic to get the dataframe is such 1st: if df['manager_movement'] == 'transfer_out', then remove rows

2nd: elseif df['manager_movement'] == 'transfer_in', then keep only the rows with 'transfer_in' and drop the other rows if there is 'no_change'.

3rd: elseif df['manager_movement'] == 'no_change', then groupby 'year' and 'email_address' and drop duplicates and keep last row

Here was my attempt but can't seem to get my desired output. Appreciate any help or comments, thank you.

def get_required_rows(x):
   if x['manager_movement'] == 'transfer_out':
      return x.loc[x['manager_movement']!='transfer_out']
   elif x['manager_movement'] == 'transfer_in':
      return x
   elif x['manager_movement'] == 'No Change':
      return x.drop_duplicates(['year','email_address'], keep='last')
   end
    
df_filtered = df.apply(get_required_rows, axis=1)
2

There are 2 best solutions below

0
wjie08 On BEST ANSWER

This worked for me but isn't that elegant, happy for edits to make it better.

#Filter the df to only retain the transfer in rows, and for these rows, remove any "no changes"
group_results = []

for i, group in df.groupby(['year','email_address']):

    transfer_in = group['manager_movement'] == 'Transfer In'
    
    if transfer_in.any():
        transfer_in_group = group.loc[transfer_in]
        #transfer_in_group = group.loc[transfer_in].drop_duplicates(keep='last')
    
    else:
        transfer_in_group = group
    
    group_results.append(transfer_in_group)

df_sub = pd.concat(group_results)


#Filter the df again where for those employees with no change, only take the latest row for each FY

group_results2 = []

for i, group in df_sub.groupby(['year','email_address']):
    no_change = group['manager_movement'] == 'No Change'
    
    if no_change.any():
        no_change_group = group.loc[no_change].drop_duplicates(['year','email_address'],keep='last')   
    
    else:
        no_change_group = group
    
    group_results2.append(no_change_group)

df_sub2 = pd.concat(group_results2)


1
Maria K On

How about doing filtering separately and the concatenating the results:

pd.concat([
    df[df["manager_movement"] == "transfer_in"],
    df[df["manager_movement"] == "no_change"].drop_duplicates(["year", "email_address"], keep='last')
])

Output:

    year year_month manager_movement         email_address
0   2022   2022_jun      transfer_in    [email protected]
8   2022   2022_oct      transfer_in    [email protected]
4   2022   2022_aug        no_change    [email protected]
9   2022   2022_oct        no_change  [email protected]
11  2023   2023_feb        no_change  [email protected]

(By the way your desired output doesn't seem to match the requirements, 1 line for [email protected] with no_change is missing)