Data Pipeline for Quantitative Research

90 Views Asked by At

I want to develop more predictive variables using my data: I have two levels (types) of varaibles that I have problem with:

  1. Working on consolidated Quotes and Trades (CQT), which contains every single tick in the stock market.
  2. Working during the downsampling (5-min time bars) process to generate new variables

For 1 is very straight forward, if I want to develop new variables off of this dataset, I simply read all of the observations in an streaming basis and calculate the variables (for example Lee and Ready 1991 tick tests) then write the new column along with the exsiting data.

For 2 I have to perform a downsampling process similiar to this:

import polars as pl
ctq_sample.groupby_dynamic("datetime", every="5m", by=["ticker", "date"]).agg(
        pl.count().alias("number_of_trades"),
        pl.col("price").first().alias("open"),
        pl.col("price").max().alias("high"),
        pl.col("price").min().alias("low"),
        pl.col("price").last().alias("close"),
        pl.col("size").sum().alias("volume"))

The above code calculates OHLCV (category 2 variable) from the CQT datasets. The storing mechanism is silimiar to 1, which is writing to local parquet files along with the downsampled data. However, I am conducting active research by developing more predictive variables from both categories (1 and 2). Say each time I want to develop a variable (number of buy trades in 5min) from the aggregation, I have to 1. read all CQT files 2. perform downsampling 3. calculate the variable 4. store. This process is very tedious and often takes very long.

I tried to go around the reading CQT files step by pickling the groupby object:

import pickle 

pickle_object = ctq_sample.groupby_dynamic("datetime", every="5m", by=["ticker", "date"])

with open("test.pickle", "wb") as file:
    pickle.dump(pickle_object, file)
with open("test.pickle", "rb") as file:
    test_obj = pickle.load(file)

test_obj.agg(
        pl.count().alias("number_of_trades"),
        pl.col("price").first().alias("open"),
        pl.col("price").max().alias("high"),
        pl.col("price").min().alias("low"),
        pl.col("price").last().alias("close"),
        pl.col("size").sum().alias("volume"))

The pickle process takes about 10mins while the reading and downsampling takes only 10 seconds for 3 days of data. I have 10 years of CQT data (60 million rows each day and 252 days a year) and I feel like this research data pipeline is not very feasible for the work that I want to do. Do you have a better solution?

0

There are 0 best solutions below