Dataset directly from tf.train.SequenceExample

1k Views Asked by Vilmar At 10 May 2018 at 07:40

I'm working with a NER-like sequence tagging in tensorflow and decided to try tf.data to see if I can get IO performance improvements with my model.

At the moment I am employing TFRecordWriter to preprocess and save my training/validation data, which is a tf.train.SequenceExample() serialized to string. I then load it with tf.data.TFRecordDataset, parse/shuffle/padded_batch it and get on with training, which works fine.

Question is:

is there a convenient way to make the dataset without first serializing and saving the SeuquenceExamples to tfrecord file?
It seems to be an unnecessary step when I'll be running the predictions on new data. I've tried playing with tf.data.Dataset.from_tensor_slices(), but it seems not suitable in this scenario as the inputs are sequences of different lengths that are not padded yet.

Original Q&A

There are 1 best solutions below

mrry On 10 May 2018 at 18:06 BEST ANSWER

It may be possible to use tf.data.Dataset.from_generator() for this case. For example, let's say your examples look like the following very simple data, with two features (of which the second represents sequential data):

examples = [("foo", [1, 2, 3, 4, 5]),
            ("bar", [6, 7]),
            ("baz", [8, 9, 10])]

You could convert this to a tf.data.Dataset with the following code:

def example_generator():
  for string_feature, sequence_feature in examples:
    yield string_feature, sequence_feature

dataset = tf.data.Dataset.from_generator(
    example_generator,
    output_types=(tf.string, tf.int32),
    output_shapes=([], [None]),  # A scalar and a variable-length vector.  
)

Dataset directly from tf.train.SequenceExample

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in SEQUENCES

Related Questions in TENSORFLOW-DATASETS

Trending Questions

Popular # Hahtags

Popular Questions