I have a Pandas DataFrame with the following structure:
import pandas as pd
data = {
'glob_order': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'trans': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'chain': [1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2],
'date': ['1/08/2023', '2/08/2023', '3/08/2023', '4/08/2023', '5/08/2023', '6/08/2023', '7/08/2023', '8/08/2023', '9/08/2023', '10/08/2023', '11/08/2023']
}
df = pd.DataFrame(data)
# Convert 'date' column to datetime
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
print(df)
I want to perform two operations:
- Group the DataFrame by the 'trans' and 'chain' columns and select the very first row from each group.
- Create a new column named 'delta' that represents the difference in days between the current date and the previous date within each group.
I tried the following code:
(df
.groupby(['trans', 'chain'])
.first()
.assign(
delta=lambda x: (x['date'] - x['date'].shift(1)).dt.total_seconds() / (60*60*24),
).reset_index()
)
However, the output I'm getting is not as expected. It seems to insert NaNs in the first delta calculation for each individual group, which is not what I want
| trans | chain | glob_order | date | delta |
|---|---|---|---|---|
| A | 1 | 1 | 2023-08-01 | NaN |
| A | 2 | 3 | 2023-08-03 | 2.0 |
| B | 1 | 5 | 2023-08-05 | 2.0 |
| B | 2 | 7 | 2023-08-07 | 2.0 |
| C | 1 | 8 | 2023-08-08 | 1.0 |
| C | 2 | 10 | 2023-08-10 | 2.0 |
I'd like to understand why this is happening and what I need to do to get the desired output.
This is my desired output
| trans | chain | glob_order | date | delta |
|---|---|---|---|---|
| A | 1 | 1 | 2023-08-01 | NaN |
| A | 2 | 3 | 2023-08-03 | 2.0 |
| B | 1 | 5 | 2023-08-05 | Nan |
| B | 2 | 7 | 2023-08-07 | 2.0 |
| C | 1 | 8 | 2023-08-08 | Nan |
| C | 2 | 10 | 2023-08-10 | 2.0 |
I am seeking a solution using a method similar to chaining for clarity and readability since I'm very new to Python.
If I understand you correctly, you want:
"trans"/"chain", get first row"trans"and get difference between daysI don't think simple chaining is possible, but you can use
:=operator (but I recommend, for readability, to split it two separate commands):Prints: