I'm currently building a project in TFX and during the Transform step I compute the "vocabulary" for a categorical variable. For later steps (but still during preprocessing), I want to use the length of that vocabulary (i.e. the number of distinct categories) to do related transformations. The number of categories is generated by an external process with items often being added (and rarely removed), and the model needs to be repeatedly trained with up-to-date data. (I'm training a multilabel classifier, and the output set of categories should match the input set, so I'm creating a SparseVector representation of the labels.) Therefore, I cannot set the vocab length at compile time. I need it to be computed when the model is trained. (After training, any additional categories can be safely filed as "unknown.")
How do I get the length of the vocab?
Things I've tried with import tensorflow_transform as tft and category_inputs being the input set of values coming from ExampleGen.
name = 'my_categories'
vocab_uri = tft.vocabulary(category_inputs, vocab_filename=name)
vocab_len = tft.analyzers.size(category_inputs, name=name)
I'm fairly sure that tft.vocabulary() call is right or close to it, but the "vocab_len" is probably wrong. The "vocab_uri" variable name is a bit of a misnomer, since it appears to be a relative path... and probably a logical path involving Protobuf. I haven't so far been able to find it stored on disk, but it's possible I've been looking in the wrong subdirectory of "pipeline_output/transform/". It's also possible it doesn't get written out until the Transform is complete.
So I tried to find a vocabulary_length function which seemed compatible. In this iteration I've been working mainly with tft.analyzers.size(), but I'm not convinced it's the right function, or that I'm calling it properly. Its output is a Tensor which I am having trouble turning into an actual, instantiated integer within the graph-mode execution of the Transform. Interpretations / manipulations I've tried:
- Raw value (as coded above) throws an error in downstream code that expects an actual numerical value and doesn't know how to handle a Tensor.
vocab_len.numpy()crashes with "AttributeError: 'Tensor' object has no attribute 'numpy'".tensorflow.get_static_value(vocab_len)returns None.vocab_len.eval()crashes with "ValueError: Cannot evaluate tensor usingeval(): No default session is registered. Usewith sess.as_default()or pass an explicit session toeval(session=sess)" I'm using TF 2.3.3, and Sessions don't appear to have existed since TF 1.x. Thetensorflow.Sessionclass I've seen reference to in docs doesn't appear to exist in TF 2.x.- wrapping the
vocab_uriandvocab_lenassignments in awith ctxt.device(ctxt.device_name):block. (The "ctxt" Context was initialized and had SYNC execution mode and was from this Tensorflow module.) That also complained thatvocab_lenwas just aPlaceholderWithDefault:0Tensor.
I've found that I can use TFTransformOutput in later components, but it seems unlikely to work when I'm still in the middle of the Transform itself... and even if it did, I don't know what argument to pass it.
I also tried slicing category_inputs.shape, but got "ValueError: Cannot convert a partially known TensorShape to a Tensor"
This feels like it should be a straightforward use-case, but I haven't found helpful docs or code.
Ideas?