I'm working with a NER-like sequence tagging in tensorflow and decided to try tf.data to see if I can get IO performance improvements with my model.
At the moment I am employing TFRecordWriter to preprocess and save my training/validation data, which is a tf.train.SequenceExample() serialized to string. I then load it with tf.data.TFRecordDataset, parse/shuffle/padded_batch it and get on with training, which works fine.
Question is:
- is there a convenient way to make the
datasetwithout firstserializingand saving the SeuquenceExamples totfrecordfile? - It seems to be an unnecessary step when I'll be running the predictions on new data. I've tried playing with
tf.data.Dataset.from_tensor_slices(), but it seems not suitable in this scenario as the inputs are sequences of different lengths that are not padded yet.
It may be possible to use
tf.data.Dataset.from_generator()for this case. For example, let's say your examples look like the following very simple data, with two features (of which the second represents sequential data):You could convert this to a
tf.data.Datasetwith the following code: