I have a dataframe and would like to create a function to keep rows or drop duplicates based on certain conditions
original dataframe
year year_month manager_movement email_address
2022 2022_jun transfer_in [email protected]
2022 2022_jun no_change [email protected]
2022 2022_jul no_change [email protected]
2022 2022_jul no_change [email protected]
2022 2022_aug no_change [email protected]
2022 2022_aug no_change [email protected]
2022 2022_sep transfer_out [email protected]
2022 2022_sep no_change [email protected]
2022 2022_oct transfer_in [email protected]
2022 2022_oct no_change [email protected]
2023 2023_jan no_change [email protected]
2023 2023_feb no_change [email protected]
Expected dataframe
year year_month manager_movement email_address
2022 2022_jun transfer_in [email protected]
2022 2022_oct transfer_in [email protected]
2022 2022_oct no_change [email protected]
2023 2023_feb no_change [email protected]
The logic to get the dataframe is such 1st: if df['manager_movement'] == 'transfer_out', then remove rows
2nd: elseif df['manager_movement'] == 'transfer_in', then keep only the rows with 'transfer_in' and drop the other rows if there is 'no_change'.
3rd: elseif df['manager_movement'] == 'no_change', then groupby 'year' and 'email_address' and drop duplicates and keep last row
Here was my attempt but can't seem to get my desired output. Appreciate any help or comments, thank you.
def get_required_rows(x):
if x['manager_movement'] == 'transfer_out':
return x.loc[x['manager_movement']!='transfer_out']
elif x['manager_movement'] == 'transfer_in':
return x
elif x['manager_movement'] == 'No Change':
return x.drop_duplicates(['year','email_address'], keep='last')
end
df_filtered = df.apply(get_required_rows, axis=1)
This worked for me but isn't that elegant, happy for edits to make it better.