How to keep only first duplicated row from Pandas DataFrame tail?

45 Views Asked by At

I have a DataFrame like this:

import pandas as pd

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10], 'name': ['mary','mary','mary','tom','tom','john','sarah','tom','tom','tom'], 'age': [30,30,30,25,25,28,36,25,25,25]})

    id    name  age
0    1    mary   30
1    2    mary   30
2    3    mary   30
3    4     tom   25
4    5     tom   25
5    6    john   28
6    7   sarah   36
7    8     tom   25
8    9     tom   25
9   10     tom   25

There are multiple duplicated rows on it (not considering 'id' column).

I want to remove only the tail duplicated rows keeping the first one (not considering 'id' column).

I expect to get this:

    id    name  age
0    1    mary   30
1    2    mary   30
2    3    mary   30
3    4     tom   25
4    5     tom   25
5    6    john   28
6    7   sarah   36
7    8     tom   25

I couldn't find a way to resolve it with drop_duplicates.

1

There are 1 best solutions below

0
Panda Kim On BEST ANSWER

Code

cols = ['name', 'age'] # column to chk duplicate
grp = df[cols].ne(df[cols].shift()).any(axis=1).cumsum() # make duplicate group
cond = grp.shift().ne(grp.max())
out = df[cond]

out

   id   name  age
0   1   mary   30
1   2   mary   30
2   3   mary   30
3   4    tom   25
4   5    tom   25
5   6   john   28
6   7  sarah   36
7   8    tom   25

Intermediate

df                    grp   grp.shift()   cond <- grp.shift() != 5
   id   name  age    
0   1   mary   30      1       NaN        True
1   2   mary   30      1       1.0        True
2   3   mary   30      1       1.0        True
3   4    tom   25      2       1.0        True
4   5    tom   25      2       2.0        True
5   6   john   28      3       2.0        True
6   7  sarah   36      4       3.0        True
7   8    tom   25      5       4.0        True
8   9    tom   25      5       5.0        False
9  10    tom   25      5       5.0        False