Pandas Iterating group by not working as expected

47 Views Asked by At

I have a df with plenty of columns. Each row is basically like a train ticket. i have the valid_from and valid_to date and some informations about the train and route. There is a not unique key, that contains some information like the ticket provider and the start and end point. i grouped the whole df by the non unique key. now i have all tickets for a specific route by a specific provider in a group. Now I sort each group by the valid_from date and afterwards filter for duplicates in all columns except the valid_from and valid_to column. This gives me the actual contracts for train tickets without any duplicates. This is my code. my_col just contains all columns except the time columns.

grouped_df = network_dfs[0].groupby('route_key')
list_dfs_grouped = []

for name, group in grouped_df:
    group.sort_values('valid_from', ascending=True, inplace=True)
    dup_first = group.drop_duplicates(subset=my_col, keep="first")


    dup_last = group.drop_duplicates(subset=my_col, keep="last")

    dup_first.loc[:, 'valid_from'] = dup_last['valid_to']
    list_dfs_grouped.append(dup_first)


clean_df = pd.concat(list_dfs_grouped, axis=0, join='inner')

The problem is, that the code is not working as expected in some cases. the drop_duplicates sometimes doesnt identify duplicates, which results in duplicated entries beeing kept.

What is weird about this behavior is, that if I run the code below:

first_group = grouped_df.get_group('my_key')
first_group = first_group.sort_values('valid_from', ascending=True, inplace=True)
dup_first = first_group.drop_duplicates(subset=my_col, keep="first")

dup_last = first_group.drop_duplicates(subset=my_col, keep="last")
dup_first['valid_from'] = dup_last['valid_to']

The code works as expected and the same group gets reduced by all duplicates. All columns have correct dtypes and I use the same grouped df in both cases. When using the iterator in some cases drop_duplicates just returns too many values.

Any suggestions, what I can do to fix this behavior?

Many thanks in advance.

0

There are 0 best solutions below