df1:
| tradeDate | securityCode |
|---|---|
| 20190319 | 000001 |
| 20220505 | 000001 |
df2:
| payoutDate | securityCode |
|---|---|
| 20180708 | 000001 |
| 20190221 | 000002 |
df3:
| endDate | securityCode |
|---|---|
| 20200203 | 000003 |
| 20210330 | 000004 |
I have 3 data frames, and the structures look like above. Size of the first data frame is (3829931, 2), size of second df is (35689, 2) and size of third df is (10329, 2). I am trying use python to do the following:
First initialized a new column named "condition". For a security code from df1, if the corresponding trade date can be found in the past 2 year range in df2 or df3, then set the condition to 1. For example, for a security code 000001 on trade date 20190319, check if there is a pair in df2 or df3 such that security code = 000001 and payout date or end date is between 20170319 and 20190319. If the requirement is met, then set condition to 1.
for the above example data frames, the expected output should be the first row's condition is 1
The attempts I made:
def check_condition(row):
# Calculate the date two years ago
two_years_ago = row['TRADEDATE'] - pd.DateOffset(years=2)
# Check if the code exists in df2 or df3 within the date range
condition_2 = ((df2['code'] == row['code']) &
(df2['PAYOUTDATE'] >= two_years_ago) &
(df2['PAYOUTDATE'] <= row['TRADEDATE'])).any()
condition_3 = ((df3['code'] == row['code']) &
(df3['ENDDATE'] >= two_years_ago) &
(df3['ENDDATE'] <= row['TRADEDATE'])).any()
# Return 1 if the condition is met, otherwise return 0
return 1 if condition_2 or condition_3 else 0
df1['condition'] = df1.apply(check_condition, axis=1)
Above are the codes I wrote. But when I run it on my local environment, it is extremely slow. I wonder if there are more efficient ways to do this? Thanks!
Instead of using
apply, you can use vectorized operations, which are more efficient:The result will be: