High performing dataframe join in Python

Question

High performing dataframe join in Python

86 Views Asked by Pijush At 16 February 2024 at 07:34

I have two data frames one have start Data and End Date, second data is having Just date. Basically One frame is having group and other have child data. So I want to join all the date which comes between start date and End date. Below is the code I am trying but understandably it takes quite a long time as I have large data.

def is_between_dates(date, start, end):
    return start <= date <= end

result_df = pd.DataFrame()
for idx, row in alarm_data.iterrows():
    mask = event_data['Date'].apply(lambda x: is_between_dates(x, row['AlarmStart'], row['AlarmEnd']))
    temp_df2 = event_data[mask].copy()
    temp_df2['AlarmStart'] = row['AlarmStart']
    temp_df2['AlarmEnd'] = row['AlarmEnd']
    result_df = pd.concat([result_df, temp_df2])

I have like 2 million data in event data frame and 200000 data in alarm data which is expected. I don't want to use spark to achieve this as it will create infrastructure burden.

Some direction will be helpful

Original Q&A

There are 2 best solutions below

**Krishnadev N** · Answer 1 · 2024-02-16T10:10:59.183000

Original answer

Your question needs to be rephrased as it is not obvious what you want to achieve without reading it multiple times inspecting the code. Nevertheless, if I understood correctly, you have a large event_data with Date, and you want to add information about the corresponding AlarmStart and AlarmEnd from alarm_data.

The main performance issue in your code is the pandas.concat, which creates a new dataframe every time, 200000 times in your case. This is clearly inefficient, since a large amount of memory is being copied a number of times. Moreover, dataframe.iterrows() is also really slow for any large dataframe, and is heavily discouraged (read more about it here).

What I would suggest instead is to make a copy of event_data (preferably with dummy columns for AlarmStart and AlarmEnd) and modify it in place. I assume there is no overlap between the different intervals between AlarmStart and AlarmEnd, so that you can uniquely assign these values to each event. Use AlarmStart and AlarmEnd from alarm_data with pandas.IntervalIndex to create your list of bins which you can pass to pandas.cut as pandas.cut(event_data['Date'], bins, labels=False) to categorise your events and obtain the corresponding bin number. Use these bin numbers to index AlarmStart and AlarmEnd and substitute the values in event_data. Note that pandas.cut does not accept overlapping IntervalIndex, so this will be the option for performance only if there is no overlap between your intervals. Also, the intervals might need to be ordered.

I hope the explanation was clear enough. Read the examples from the documentation and try playing with a subset of your data first.

Example

Following comments that this does not really answer the question, I use a similar dataset as the other answer for demonstration. Note that this WILL NOT work if your intervals overlap (You did not specify this in the question). The other answer is more general, but I believe this is quicker if the intervals do not overlap, although I haven't tested either solutions to any large dataset.

In [1]: import numpy as np
   ...: import pandas as pd

In [2]: edges = np.sort(np.random.randint(1, 100, 10))

In [3]: alarm_data = pd.DataFrame(
   ...:     {
   ...:         "AlarmStart": edges[:-1],
   ...:         "AlarmEnd": edges[1:],
   ...:     }
   ...: )
   ...: 
   ...: event_data = pd.DataFrame(
   ...:     {
   ...:         "Date": np.random.randint(edges.min(), edges.max(), 7),
   ...:         "Event": ["A", "B", "C", "D", "E", "F", "G"],
   ...:     }
   ...: )

In [4]: bins = pd.IntervalIndex.from_arrays(alarm_data['AlarmStart'], alarm_data['AlarmEnd'])

In [5]: alarm_data
Out[5]: 
   AlarmStart  AlarmEnd
0           1         5
1           5        13
2          13        18
3          18        21
4          21        33
5          33        46
6          46        47
7          47        76
8          76        87

In [6]: event_data
Out[6]: 
   Date Event
0    42     A
1    18     B
2    29     C
3    49     D
4    15     E
5    44     F
6    62     G

In [7]: bins
Out[7]: IntervalIndex([(1, 5], (5, 13], (13, 18], (18, 21], (21, 33], (33, 46], (46, 47], (47, 76], (76, 87]], dtype='interval[int64, right]')

In [8]: res = pd.cut(event_data['Date'], bins)

In [9]: event_data[['AlarmStart', 'AlarmEnd']] = [[x.left, x.right] for x in res]

In [10]: event_data
Out[10]: 
   Date Event  AlarmStart  AlarmEnd
0    42     A          33        46
1    18     B          13        18
2    29     C          21        33
3    49     D          47        76
4    15     E          13        18
5    44     F          33        46
6    62     G          47        76

**Teemu Risikko** · Answer 2 · 2024-02-16T10:13:45.850000

A solution

You're doing quite heavy operations in each row so no wonder it takes time.

If you have understood your question correctly, here's one solution that's a third faster or so.

alarm_data["events"] = alarm_data.apply(lambda row: event_data[(row['AlarmStart'] <= event_data['Date']) & (event_data['Date'] <= row['AlarmEnd'])].index.to_list(), axis=1)

result_df = alarm_data.explode("events").join(event_data, on="events")

Explanation

I used this as a sample data:

alarm_data = pd.DataFrame(
    {
        "AlarmStart": [1, 3, 5, 2, 5, 2, 1],
        "AlarmEnd": [3, 4, 6, 4, 6, 3, 2]
    }
)

event_data = pd.DataFrame(
    {
        "Date": [1, 2, 3, 4, 5, 6, 7],
        "Event": ["A", "B", "C", "D", "E", "F", "G"],
    }
)

This solution applies the filter directly on each row of alarm_data without a for loop. Getting rid of the second apply for event_data also helps.

This apply alarm_data.apply(lambda row:... will get for each row in the alarm_data the corresponding event indices from event_data as a list. The step will look about like this:

   AlarmStart  AlarmEnd     events
0           1         3  [0, 1, 2]
1           3         4     [2, 3]
2           5         6     [4, 5]
...

Next we need to explode the events to new rows with .explode():

   AlarmStart  AlarmEnd events
0           1         3      0
0           1         3      1
0           1         3      2
1           3         4      2
...

And finally join rest of the event_data to the dataframe using the events column as the join key:

.join(event_data, on="events")

The result will look about like this:

   AlarmStart  AlarmEnd events  Date Event
0           1         3      0     1     A
0           1         3      1     2     B
0           1         3      2     3     C
1           3         4      2     3     C
...

High performing dataframe join in Python

There are 2 best solutions below

Original answer

Example

A solution

Explanation

Related Questions in PYTHON

Related Questions in OPTIMIZATION

Related Questions in LARGE-DATA

Related Questions in DATA-PROCESSING

Trending Questions

Popular # Hahtags

Popular Questions