Problem
I'm looking for suggestions on how to make this more pythonic and improve the efficiency.
I have a dataframe with events, each having at minimum a start and end timestamp. I am expanding the number of records so that the new table has one record for each hour the interval overlaps.
This is basically the same usecase as the IntervalMatch function found in QlikView.
Example: An event from 18:00-20:00 expands to two distinct records, one for 18:00-19:00 and another for 19:00-20:00.
Current solution
I have a fully working solution, but I think it is rather ugly and it is quite slow on large datasets with >100k rows and 10-20 columns.
import pandas as pd
from datetime import timedelta
def interval_match(df):
intervals = []
def perdelta(start, end, delta):
curr = start.replace(minute=0, second=0)
while curr < end:
yield curr
curr += delta
def interval_split(x):
for t in perdelta(x.Start, x.End, timedelta(hours=1)):
_ = ([x.id,
x.Start,
x.End,
max(t, x.Start),
min((t+timedelta(hours=1), x.End))])
intervals.append(_)
df.apply(interval_split, axis=1)
ndf = pd.DataFrame(intervals,
columns=['id',
'Start',
'End',
'intervalStart',
'intervalEnd'])
ndf['Duration'] = ndf.iEnd - ndf.iStart
return ndf
With some example data, the function interval_match() can be used like this:
# Some example data
df = pd.DataFrame({'End': {0: pd.Timestamp('2016-01-01 09:24:20')},
'Start': {0: pd.Timestamp('2016-01-01 06:56:10')},
'id': {0: 1234562}})
# Running the function
interval_match(df).to_dict()
# Output
{'Duration': {0: Timedelta('0 days 00:03:50'),
1: Timedelta('0 days 01:00:00'),
2: Timedelta('0 days 01:00:00'),
3: Timedelta('0 days 00:24:20')},
'End': {0: Timestamp('2016-01-01 09:24:20'),
1: Timestamp('2016-01-01 09:24:20'),
2: Timestamp('2016-01-01 09:24:20'),
3: Timestamp('2016-01-01 09:24:20')},
'Start': {0: Timestamp('2016-01-01 06:56:10'),
1: Timestamp('2016-01-01 06:56:10'),
2: Timestamp('2016-01-01 06:56:10'),
3: Timestamp('2016-01-01 06:56:10')},
'intervalEnd':{0: Timestamp('2016-01-01 07:00:00'),
1: Timestamp('2016-01-01 08:00:00'),
2: Timestamp('2016-01-01 09:00:00'),
3: Timestamp('2016-01-01 09:24:20')},
'intervalStart': {0: Timestamp('2016-01-01 06:56:10'),
1: Timestamp('2016-01-01 07:00:00'),
2: Timestamp('2016-01-01 08:00:00'),
3: Timestamp('2016-01-01 09:00:00')},
'id': {0: 1234562,
1: 1234562,
2: 1234562,
3: 1234562}}
My desire is to
- Make this more efficient, preferrably using built in Pandas functions or some numpy magic.
- Not have to deal with the columns as I do in the interval_split function today. Just operate on, and expand the entire dataframe.
Appreciations for any suggestions or help.
I made a variant (inspired by your code) and it ran very slowly. I was getting ~5 minutes for processing 20k rows of data, and the culprit after profiling was the
.append. There is a trick to put all the records into a dictionary, and then to use aDataFrame'sfrom_dictmethod. Using from_dict for the same 20k rows, it completed in about 5 seconds (so ~60x faster).I've attached my code that was inspired by yours, and it is also generic for the column inputs (my test use vs production use is difference).