Find if the matching date exists

98 Views Asked by Y L At 01 February 2024 at 09:21

df1:

tradeDate	securityCode
20190319	000001
20220505	000001

df2:

payoutDate	securityCode
20180708	000001
20190221	000002

df3:

endDate	securityCode
20200203	000003
20210330	000004

I have 3 data frames, and the structures look like above. Size of the first data frame is (3829931, 2), size of second df is (35689, 2) and size of third df is (10329, 2). I am trying use python to do the following:

First initialized a new column named "condition". For a security code from df1, if the corresponding trade date can be found in the past 2 year range in df2 or df3, then set the condition to 1. For example, for a security code 000001 on trade date 20190319, check if there is a pair in df2 or df3 such that security code = 000001 and payout date or end date is between 20170319 and 20190319. If the requirement is met, then set condition to 1.

for the above example data frames, the expected output should be the first row's condition is 1

The attempts I made:

def check_condition(row):
    # Calculate the date two years ago
    two_years_ago = row['TRADEDATE'] - pd.DateOffset(years=2)

    # Check if the code exists in df2 or df3 within the date range
    condition_2 = ((df2['code'] == row['code']) & 
                    (df2['PAYOUTDATE'] >= two_years_ago) & 
                     (df2['PAYOUTDATE'] <= row['TRADEDATE'])).any()
    condition_3 = ((df3['code'] == row['code']) & 
                      (df3['ENDDATE'] >= two_years_ago) & 
                      (df3['ENDDATE'] <= row['TRADEDATE'])).any()

    # Return 1 if the condition is met, otherwise return 0
    return 1 if condition_2 or condition_3 else 0

df1['condition'] = df1.apply(check_condition, axis=1)

Above are the codes I wrote. But when I run it on my local environment, it is extremely slow. I wonder if there are more efficient ways to do this? Thanks!

Original Q&A

There are 2 best solutions below

e-motta On 01 February 2024 at 10:16

Instead of using apply, you can use vectorized operations, which are more efficient:

# Define the range of 2 years
timedelta = pd.to_datetime('today') - pd.DateOffset(years=2)

# Check conditions using vectorized operations
df2_conditions = ((df2['payoutDate'] >= timedelta) & (df2['payoutDate'] <= df1['tradeDate']))
df3_conditions = ((df3['endDate'] >= timedelta) & (df3['endDate'] <= df1['tradeDate']))

# Fill 'condition'
df1.loc[df2_conditions.any() | df3_conditions.any(), 'condition'] = 1
df1['condition'] = df1.loc[:, 'condition'].fillna(0)

# Drop empty rows
df1 = df1.dropna()

The result will be:

   tradeDate securityCode  condition
0 2019-03-19       000001        1.0
1 2022-05-05       000001        0.0

Nyps On 01 February 2024 at 10:25

From the data examples that you provided, it seems like it might make sense to do the comparison in some steps, e.g.:

For df2 find all rows where the same securityCode is found
Only for those rows, check the dates
If one is found -> return 1 without checking df3, else repeat the process for df3

Find if the matching date exists

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in ALGORITHM

Related Questions in TIME-SERIES

Related Questions in QUANTITATIVE-FINANCE

Trending Questions

Popular # Hahtags

Popular Questions