How do I find the length of a vocabulary computed during TFX Transform?

181 Views Asked by Sarah Messer At 21 October 2021 at 19:16

I'm currently building a project in TFX and during the Transform step I compute the "vocabulary" for a categorical variable. For later steps (but still during preprocessing), I want to use the length of that vocabulary (i.e. the number of distinct categories) to do related transformations. The number of categories is generated by an external process with items often being added (and rarely removed), and the model needs to be repeatedly trained with up-to-date data. (I'm training a multilabel classifier, and the output set of categories should match the input set, so I'm creating a SparseVector representation of the labels.) Therefore, I cannot set the vocab length at compile time. I need it to be computed when the model is trained. (After training, any additional categories can be safely filed as "unknown.")

How do I get the length of the vocab?

Things I've tried with import tensorflow_transform as tft and category_inputs being the input set of values coming from ExampleGen.

name = 'my_categories'
vocab_uri = tft.vocabulary(category_inputs, vocab_filename=name)
vocab_len = tft.analyzers.size(category_inputs, name=name)

I'm fairly sure that tft.vocabulary() call is right or close to it, but the "vocab_len" is probably wrong. The "vocab_uri" variable name is a bit of a misnomer, since it appears to be a relative path... and probably a logical path involving Protobuf. I haven't so far been able to find it stored on disk, but it's possible I've been looking in the wrong subdirectory of "pipeline_output/transform/". It's also possible it doesn't get written out until the Transform is complete.

So I tried to find a vocabulary_length function which seemed compatible. In this iteration I've been working mainly with tft.analyzers.size(), but I'm not convinced it's the right function, or that I'm calling it properly. Its output is a Tensor which I am having trouble turning into an actual, instantiated integer within the graph-mode execution of the Transform. Interpretations / manipulations I've tried:

Raw value (as coded above) throws an error in downstream code that expects an actual numerical value and doesn't know how to handle a Tensor.
vocab_len.numpy() crashes with "AttributeError: 'Tensor' object has no attribute 'numpy'".
tensorflow.get_static_value(vocab_len) returns None.
vocab_len.eval() crashes with "ValueError: Cannot evaluate tensor using eval(): No default session is registered. Use with sess.as_default() or pass an explicit session to eval(session=sess)" I'm using TF 2.3.3, and Sessions don't appear to have existed since TF 1.x. The tensorflow.Session class I've seen reference to in docs doesn't appear to exist in TF 2.x.
wrapping the vocab_uri and vocab_len assignments in a with ctxt.device(ctxt.device_name): block. (The "ctxt" Context was initialized and had SYNC execution mode and was from this Tensorflow module.) That also complained that vocab_len was just a PlaceholderWithDefault:0 Tensor.

I've found that I can use TFTransformOutput in later components, but it seems unlikely to work when I'm still in the middle of the Transform itself... and even if it did, I don't know what argument to pass it.

I also tried slicing category_inputs.shape, but got "ValueError: Cannot convert a partially known TensorShape to a Tensor"

This feels like it should be a straightforward use-case, but I haven't found helpful docs or code.

Ideas?

Original Q&A

How do I find the length of a vocabulary computed during TFX Transform?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in TENSOR

Related Questions in VOCABULARY

Related Questions in TFX

Trending Questions

Popular # Hahtags

Popular Questions