So I'm working with time series data, forex data to be more precise. They are in the form of
<DATE> <OPEN> <HIGH> <LOW> <CLOSE>
2023.01.03 1.20782 1.20782 1.20775 1.20775
2023.01.03 1.20771 1.20771 1.20771 1.20771
2023.01.03 1.20739 1.20739 1.20739 1.20731
2023.01.03 1.20742 1.20742 1.20742 1.20742
2023.01.03 1.20736 1.20736 1.20736 1.20736
2023.01.03 1.20744 1.20744 1.20744 1.20744
2023.01.03 1.20762 1.20762 1.20762 1.20762
I need to add a new column 'target' of true and false values. When we take a row, we check up all the subsequent rows on the same day. If any of them has a high of 100 points more than the current row the target is put to true, but if while checking the subsequent row we find one that has a low of 10 points less than the current row the target is put to false.
My solution to the problem is to iterate over all the rows saving them in a list as open trade and while iterating comparing the rows in the list of trade with the current row to check if the trade end with a stop loss or a take profit. Here is my code :
def add_target_data(df):
new_df = df.copy()
new_df['target'] = False
grouped = new_df.groupby('date')
print("Number of days", len(grouped.groups))
for _date, day in grouped:
#print('Day:', _date)
trades = []
update_indices = []
for index, row in day.iterrows():
# trades = [_trade for _trade in trades if _trade['stop_loss'] < row['low']]
for trade in trades:
if trade['stop_loss'] >= row['low']:
trades.remove(trade)
elif trade['take_profit'] <= row['high']:
#data.loc[trade['index'], 'target'] = True
update_indices.append(trade['index'])
trades.remove(trade)
else:
pass
trade = {
'index': index,
'stop_loss': row['close'] - 0.0001,
'take_profit': row['close'] + 0.001
}
trades.append(trade)
new_df.loc[update_indices, 'target'] = True
return new_df
After running this code on a data frame of about 100000 rows, here is what we should get :
date open high low close target
index
2023-01-02 08:41:00 2023-01-02 1.20672 1.20673 1.20667 1.20667 False
2023-01-02 08:44:00 2023-01-02 1.20665 1.20667 1.20664 1.20667 True
2023-01-02 08:46:00 2023-01-02 1.20665 1.20666 1.20665 1.20666 False
2023-01-02 09:06:00 2023-01-02 1.20627 1.20635 1.20627 1.20635 False
2023-01-02 09:08:00 2023-01-02 1.20637 1.20638 1.20629 1.20633 True
... ... ... ... ... ... ...
2023-04-11 16:42:00 2023-04-11 1.24262 1.24267 1.24240 1.24240 True
2023-04-11 16:45:00 2023-04-11 1.24276 1.24283 1.24257 1.24271 False
2023-04-11 16:46:00 2023-04-11 1.24272 1.24282 1.24264 1.24278 False
2023-04-11 16:58:00 2023-04-11 1.24295 1.24302 1.24285 1.24289 False
2023-04-11 17:46:00 2023-04-11 1.24308 1.24308 1.24253 1.24258 True
(please this sample is just an example, in reality the rows that get a target value of true constitute only about 10% of the entire data frame)
This solution, works fine but when the data reaches few millions rows it becomes a pain to run. Does anybody knows how to do it faster and more efficiently?