Lengths of overlapping time ranges listed by rows

168 Views Asked by At

I am using pandas version 1.0.5

The example dataframe below lists time intervals, recorded over three days, and I seek where some time intervals overlap every day.

Interim- for illustration

For example, one of the overlapping time across all the three dates (yellow highlighted) is 1:16 - 2:13. The other (blue highlighted) would be 18:45 - 19:00

So my expected output would be like: [57,15] because

  • 57 - Minutes between 1:16 - 2:13.
  • 15 - Minutes between 18:45 - 19:00

Please use this generator of the input dataframe:

import pandas as pd
dat1 = [
    ['2023-12-27','2023-12-27 00:00:00','2023-12-27 02:14:00'],
    ['2023-12-27','2023-12-27 03:16:00','2023-12-27 04:19:00'],
    ['2023-12-27','2023-12-27 18:11:00','2023-12-27 20:13:00'],
    ['2023-12-28','2023-12-28 01:16:00','2023-12-28 02:14:00'],
    ['2023-12-28','2023-12-28 02:16:00','2023-12-28 02:28:00'],
    ['2023-12-28','2023-12-28 02:30:00','2023-12-28 02:56:00'],
    ['2023-12-28','2023-12-28 18:45:00','2023-12-28 19:00:00'],
    ['2023-12-29','2023-12-29 01:16:00','2023-12-29 02:13:00'],
    ['2023-12-29','2023-12-29 04:16:00','2023-12-29 05:09:00'],
    ['2023-12-29','2023-12-29 05:11:00','2023-12-29 05:14:00'],
    ['2023-12-29','2023-12-29 18:00:00','2023-12-29 19:00:00']
       ]
df = pd.DataFrame(dat1,columns = ['date','Start_tmp','End_tmp'])
df["Start_tmp"] = pd.to_datetime(df["Start_tmp"])
df["End_tmp"] = pd.to_datetime(df["End_tmp"])
2

There are 2 best solutions below

0
OCa On BEST ANSWER

This solution uses:

  • numpy, no uncommon Python modules, so using Python 1.0.5 you should, hopefully, be in the clear,
  • no nested loops to care for speed issues with growing dataset,

Method:

  • Draw the landscape of overlaps
  • Then select the overlaps corresponding to the number of documented days,
  • Finally describe the overlaps in terms of their lengths

Number of documented days: (as in Python: Convert timedelta to int in a dataframe)

n = 1 + ( max(df['End_tmp']) - min(df['Start_tmp']) ).days
n
3

Additive landscape:

# initial flat whole-day landcape (height: 0)
L = np.zeros(24*60, dtype='int')
# add up ranges: (reused @sammywemmy's perfect formula for time of day in minutes)
for start, end in zip(df['Start_tmp'].dt.hour.mul(60) + df['Start_tmp'].dt.minute,  # Start_tmp timestamps expressed in minutes
                      df['End_tmp'].dt.hour.mul(60)   + df['End_tmp'].dt.minute):   # End_tmp timestamps expressed in minutes
    L[start:end+1] += 1

plt.plot(L)
plt.hlines(y=[2,3],xmin=0,xmax=1400,colors=['green','red'], linestyles='dashed')
plt.xlabel('time of day (minutes)')
plt.ylabel('time range overlaps')

additive landscape (Please excuse the typo: these are obviously minutes, not seconds)

Keep only overlaps over all days: (red line, n=3)

# Reduce heights <n to 0 because not overlaping every day
L[L<n]=0
# Simplify all greater values to 1 because only their presence matters
L[L>0]=1
# Now only overlaps are highlighted
# (Actually this latest line is disposable, provided we filter all but the overlaps of rank n. Useful only if you were to include lower overlaps)

Extract overlap ranges and their lengths

# Highlight edges of overlaping intervals
D = np.diff(L)
# Describe overlaps as ranges
R = list(zip([a[0]   for a in np.argwhere(D>0)],  # indices where overlaps *begin*, with scalar indices instead of arrays
             [a[0]-1 for a in np.argwhere(D<0)])) # indices where overlaps *end*, with scalar indices instead of arrays
R
[(75, 132), (1124, 1139)]
# Finally their lengths
[b-a for a,b in R]

Final output: [57, 15]

2
sammywemmy On

I assume you are only interested in the hours and minutes, and not the entire timestamp. If my assumption is correct, you can execute a range join with conditional_join from pyjanitor to get your results:

# pip install pyjanitor
import janitor
import pandas as pd
A=(df
   .assign(
       start=df.Start_tmp.dt.hour.mul(60)+ df.Start_tmp.dt.minute,
       end=df.End_tmp.dt.hour.mul(60)+df.End_tmp.dt.minute)
   )
(A
.conditional_join(
    A,
    ('start','start','>'),
    ('end','end','<'),
    df_columns=['start','end'],
    right_columns=None)
.pipe(lambda df: df.end-df.start)
)

0    15
1    57
dtype: int32