Overlapping window using tf.data pipelines

30 Views Asked by At

I'm trying to transform some data read from CSV files using tf.data pipelines and overlapping windows and its not working as expected. All the documentation is not providing clear explanation on how to deal with this case. The columns of the csv files are 'timestamp','open','high', 'low', 'close', 'volume'.

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern="/path/stock/*1min*.csv",
    batch_size=1,
    num_epochs=1,
    shuffle=False,
    header=False,
    column_names=['timestamp','open','high', 'low', 'close', 'volume'],
    column_defaults=[tf.string, tf.float32, tf.float32, tf.float32, tf.float32, tf.float32]
).window(
    size=5,  # Number of rows per window
    shift=1,  # Stride for overlapping windows
    stride=1
)

This produces the following structure:

-WindowDataset
--OrderedDict
---VariantDataset
----Tensor (single element)
----Tensor...

This is not allowing me to transform in a simple way because OrderedDict has not batch method and I cannot flatten following the documentation.

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern="/path/stock/*1min*.csv",
    batch_size=1,
    num_epochs=1,
    shuffle=False,
    header=False,
    column_names=['timestamp','open','high', 'low', 'close', 'volume'],
    column_defaults=[tf.string, tf.float32, tf.float32, tf.float32, tf.float32, tf.float32]
).window(
    size=5,  # Number of rows per window
    shift=1,  # Stride for overlapping windows
    stride=1
).flat_map(lambda window: window.batch(5))

Gives the following error:

AttributeError                            Traceback (most recent call last)

<ipython-input-47-46d1550f08a0> in <cell line: 1>()
     11     shift=1,  # Stride for overlapping windows
     12     stride=1
---> 13 ).flat_map(lambda window: window.batch(5))

19 frames

/tmp/__autograph_generated_filersrgq3km.py in <lambda>(lscope)
      3 
      4     def inner_factory(ag__):
----> 5         tf__lam = lambda window: ag__.with_function_scope(lambda lscope: ag__.converted_call(window.batch, (5,), None, lscope), 'lscope', ag__.STD)
      6         return tf__lam
      7     return inner_factory

AttributeError: in user code:

    File "<ipython-input-47-46d1550f08a0>", line 13, in None  *
        lambda window: window.batch(5)

    AttributeError: 'collections.OrderedDict' object has no attribute 'batch'

If I try to batch the datasets of the OrderedDict, I get the following error

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-59-16e14170c082> in <cell line: 25>()
     23 
     24 
---> 25 data = dataset.map(extract)
     26 
     27 

35 frames

/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/tensor.py in __getattr__(self, name)
    259         tf.experimental.numpy.experimental_enable_numpy_behavior()
    260       """)
--> 261     self.__getattribute__(name)
    262 
    263   @property

AttributeError: in user code:

    File "<ipython-input-59-16e14170c082>", line 3, in extract  *
        opens = data.get('open').flat_map(lambda x: x.batch(5))

    AttributeError: 'SymbolicTensor' object has no attribute 'batch'

This is becoming extremely confusing.

What would be a the right way to transform this structure so that I can later apply better transformations to build a timeseries dataset.

0

There are 0 best solutions below