I am trying to figure out how to calculate days between two events

95 Views Asked by At

I have a list of events. Every case (case ID) will eventually go through the events( event A, Event B). I have columns containing time stamps for the respective events. I am trying to find the days between Event A and Event B based on the below using Python. However, some cases go through event a more than one time, and I need the time lapse to be based on the latest date. In the example below, latest date for Case ID 1 would be 1/20/23, ignoring the previous two occurrences of EVENT A for this case ID.

CaseID EVENT DATE
1 A 1/1/23
1 A 1/3/23
1 A 1/20/23
1 B 2/1/23
8 A 1/2/23
100 A 3/1/23
100 B 3/2/23
35 A 2/13/23
35 B 2/27/23
6 A 2/14/23
33 A 2/26/23
2 A 3/4/23
2 B 4/30/23

I have tried sorting by duplicates, but I’m just not as comfortable using Python as SQL, and I’m unable to use SQL. Also trying to have to avoid dropping duplicates manually. The output I am expecting would look like the below: |caseID|busdaysbweventAB|

5

There are 5 best solutions below

0
Corralien On

You can use:

# Convert to datetime64 if needed
df['DATE'] = pd.to_datetime(df['DATE'])

# Pandas part: reshape your dataframe
out = df.sort_values('DATE').pivot_table(index='CaseID', columns='EVENT', values='DATE', aggfunc='last')
m = out.notna().all(axis=1)

# Numpy part: compute business day
arr = out[m].values.astype('datetime64[D]')
out.loc[m, 'bdays'] = np.busday_count(arr[:, 0], arr[:, 1])

Output:

>>> out
EVENT           A          B  bdays
CaseID                             
1      2023-01-20 2023-02-01    8.0
2      2023-03-04 2023-04-30   40.0
6      2023-02-14        NaT    NaN
8      2023-01-02        NaT    NaN
33     2023-02-26        NaT    NaN
100    2023-03-01 2023-03-02    1.0
0
SIGHUP On

Your input data is in the form of a pipe-delimited CSV file. It's very straightforward so no need for any module imports to handle it.

Build a dictionary keyed on CaseID. Each associated value should be a dictionary that could have both 'A' and 'B' keys. The dates should be in a list associated with those keys.

You'll need to parse the dates to find the maximum (most recent) then do a little bit of arithmetic.

Given the input file foo.csv with the following content:

|CaseID|EVENT|DATE|
|1|A|1/1/23|
|1|A|1/3/23|
|1|A|1/20/23|
|1|B|2/1/23|
|8|A|1/2/23|
|100|A|3/1/23|
|100|B|3/2/23|
|6|A|2/14/23|
|33|A|2/26/23|
|2|A|3/4/23|
|2|B|4/30/23|

...the code could look like this:

from datetime import datetime

def parse(d):
    return datetime.strptime(d, '%m/%d/%y')

mydict = dict()

with open('foo.csv') as data:
    next(data) # skip column headers
    for line in data:
        _, case, event, date, *_ = line.split('|')
        _date = parse(date)
        if case in mydict:
            mydict[case].setdefault(event, []).append(_date)
        else:
            mydict[case] = {event: [_date]}

for k, v in mydict.items():
    if 'A' in v and 'B' in v:
        maxa = max(v['A'])
        maxb = max(v['B'])
        print('CaseID', k, abs((maxa-maxb).days))

Output:

CaseID 1 12
CaseID 100 1
CaseID 2 57
0
user19077881 On

Assuming that the data is as demonstrated (ie caseID values and DATE sorted - else you can sort before further processing) then firstly read the CSV data into pandas DataFrame; then use:

#convert date strings to datetime format
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst = False)
#drop duplicated A rows
df = df.drop_duplicates(subset = ['CaseID', 'EVENT'], keep = 'last')
#calculate difference of days
df['days'] = df.groupby('CaseID')['DATE'].diff().dt.days
#and finally lose the case without an A and a B event
df = df.dropna()

which results in:

    CaseID EVENT       DATE  days
3        1     B 2023-02-01  12.0
6      100     B 2023-03-02   1.0
10       2     B 2023-04-30  57.0
0
PARAK On

Another aproach is to group by event and id, get the max date, group again by id and get the difference between the dates using apply.

from io import StringIO

import pandas as pd

table = """
CaseID,EVENT,DATE
1,A,1/1/23
1,A,1/3/23
1,A,1/20/23
1,B,2/1/23
8,A,1/2/23
100,A,3/1/23
100,B,3/2/23
6,A,2/14/23
33,A,2/26/23
2,A,3/4/23
2,B,4/30/23
"""

df = pd.read_csv(StringIO(table), sep=',')
df['DATE'] = pd.to_datetime(df['DATE'])
d = df.groupby(['CaseID', 'EVENT']).max().reset_index().groupby('CaseID').aggregate({'DATE': lambda x: x.diff().to_numpy()[-1]})

OUTPUT:

          DATE
CaseID        
1      12 days
2      57 days
6          NaT
8          NaT
33         NaT
100     1 days
0
Yusuf Adel On
CaseID EVENT DATE
1 A 1/1/23
1 A 1/3/23
1 A 1/20/23
1 B 2/1/23
8 A 1/2/23
100 A 3/1/23
100 B 3/2/23
35 A 2/13/23
35 B 2/27/23
6 A 2/14/23
33 A 2/26/23
2 A 3/4/23
2 B 4/30/23

import pandas as pd


def get_time_lapsed(df, case_id, event, date):
    df = df.sort_values(by=[case_id, date])
    df["time_lapsed"] = df.groupby(case_id)[date].diff()
    return df
def test_get_time_lapsed():
    df = pd.DataFrame(
        {
            "CaseID": [1, 1, 1, 1, 8, 100, 100, 35, 35, 6, 33, 2, 2],
            "EVENT": ["A", "A", "A", "B", "A", "A", "B", "A", "B", "A", "A", "A", "B"],
            "DATE": [
                "1/1/23",
                "1/3/23",
                "1/20/23",
                "2/1/23",
                "1/2/23",
                "3/1/23",
                "3/2/23",
                "2/13/23",
                "2/27/23",
                "2/14/23",
                "2/26/23",
                "3/4/23",
                "4/30/23",
            ],
        }
    )
    df["DATE"] = pd.to_datetime(df["DATE"])
    df = get_time_lapsed(df, "CaseID", "EVENT", "DATE")
    print(df)

    CaseID EVENT       DATE time_lapsed
0        1     A 2023-01-01         NaT
1        1     A 2023-01-03      2 days
2        1     A 2023-01-20     17 days
3        1     B 2023-02-01     12 days
11       2     A 2023-03-04         NaT
12       2     B 2023-04-30     57 days
9        6     A 2023-02-14         NaT
4        8     A 2023-01-02         NaT
10      33     A 2023-02-26         NaT
7       35     A 2023-02-13         NaT
8       35     B 2023-02-27     14 days
5      100     A 2023-03-01         NaT
6      100     B 2023-03-02      1 days