Find if the matching date exists

98 Views Asked by At

df1:

tradeDate securityCode
20190319 000001
20220505 000001

df2:

payoutDate securityCode
20180708 000001
20190221 000002

df3:

endDate securityCode
20200203 000003
20210330 000004

I have 3 data frames, and the structures look like above. Size of the first data frame is (3829931, 2), size of second df is (35689, 2) and size of third df is (10329, 2). I am trying use python to do the following:

First initialized a new column named "condition". For a security code from df1, if the corresponding trade date can be found in the past 2 year range in df2 or df3, then set the condition to 1. For example, for a security code 000001 on trade date 20190319, check if there is a pair in df2 or df3 such that security code = 000001 and payout date or end date is between 20170319 and 20190319. If the requirement is met, then set condition to 1.

for the above example data frames, the expected output should be the first row's condition is 1

The attempts I made:

def check_condition(row):
    # Calculate the date two years ago
    two_years_ago = row['TRADEDATE'] - pd.DateOffset(years=2)

    # Check if the code exists in df2 or df3 within the date range
    condition_2 = ((df2['code'] == row['code']) & 
                    (df2['PAYOUTDATE'] >= two_years_ago) & 
                     (df2['PAYOUTDATE'] <= row['TRADEDATE'])).any()
    condition_3 = ((df3['code'] == row['code']) & 
                      (df3['ENDDATE'] >= two_years_ago) & 
                      (df3['ENDDATE'] <= row['TRADEDATE'])).any()

    # Return 1 if the condition is met, otherwise return 0
    return 1 if condition_2 or condition_3 else 0

df1['condition'] = df1.apply(check_condition, axis=1)

Above are the codes I wrote. But when I run it on my local environment, it is extremely slow. I wonder if there are more efficient ways to do this? Thanks!

2

There are 2 best solutions below

1
e-motta On

Instead of using apply, you can use vectorized operations, which are more efficient:

# Define the range of 2 years
timedelta = pd.to_datetime('today') - pd.DateOffset(years=2)

# Check conditions using vectorized operations
df2_conditions = ((df2['payoutDate'] >= timedelta) & (df2['payoutDate'] <= df1['tradeDate']))
df3_conditions = ((df3['endDate'] >= timedelta) & (df3['endDate'] <= df1['tradeDate']))

# Fill 'condition'
df1.loc[df2_conditions.any() | df3_conditions.any(), 'condition'] = 1
df1['condition'] = df1.loc[:, 'condition'].fillna(0)

# Drop empty rows
df1 = df1.dropna()

The result will be:

   tradeDate securityCode  condition
0 2019-03-19       000001        1.0
1 2022-05-05       000001        0.0
0
Nyps On

From the data examples that you provided, it seems like it might make sense to do the comparison in some steps, e.g.:

  1. For df2 find all rows where the same securityCode is found
  2. Only for those rows, check the dates
  3. If one is found -> return 1 without checking df3, else repeat the process for df3