Multihot encoding in tensoflow (google cloud machine learning, tf estimator api)

1.8k Views Asked by At

I have a feature like a post tag. So for each observation the post_tag feature might be a selection of tags like "oscars,brad-pitt,awards". I'd like to be able to pass this as a feature to a tensorflow model build using the estimator api running on google cloud machine learning (as per this example but adapted for my own problem).

I'm just not sure how to transform this into a multi-hot encoded feature in tensorflow. I'm trying to get something similar to MultiLabelBinarizer in sklearn ideally.

I think this is sort of related but not quite what i need.

So say i have data like:

id,post_tag
1,[oscars,brad-pitt,awards]
2,[oscars,film,reviews]
3,[matt-damon,bourne]

I want to featurize it, as part of preprocessing within tensorflow, as:

id,post_tag_oscars,post_tag_brad_pitt,post_tag_awards,post_tag_film,post_tag_reviews,post_tag_matt_damon,post_tag_bourne
1,1,1,1,0,0,0,0
2,1,0,0,1,1,0,0
3,0,0,0,0,0,1,1

Update

If i have post_tag_list be a string like "oscars,brad-pitt,awards" in the input csv. And if i try then do:

INPUT_COLUMNS = [
...
tf.contrib.lookup.HashTable(tf.contrib.lookup.KeyValueTensorInitializer('post_tag_list',
                                            tf.range(0, 10, dtype=tf.int64),
                                            tf.string, tf.int64),
                           default_value=10, name='post_tag_list'),
...]

I get this error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/task.py", line 4, in <module>
    import model
  File "trainer/model.py", line 49, in <module>
    default_value=10, name='post_tag_list'),
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 276, in __init__
    super(HashTable, self).__init__(table_ref, default_value, initializer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 162, in __init__
    self._init = initializer.initialize(self)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/lookup_ops.py", line 348, in initialize
    table.table_ref, self._keys, self._values, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_lookup_ops.py", line 205, in _initialize_table_v2
    values=values, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2632, in create_op
    set_shapes_for_outputs(ret)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1911, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1861, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 595, in call_cpp_shape_fn
    require_shape_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 659, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Shape must be rank 1 but is rank 0 for 'key_value_init' (op: 'InitializeTableV2') with input shapes: [], [], [10].

If i was to pad each post_tag_list to be like "oscars,brad-pitt,awards,OTHER,OTHER,OTHER,OTHER,OTHER,OTHER,OTHER" so it's always 10 long. Would that be a potential solution here.

Or do i need to in some way know the size of all post tags i might ever be passing in here (kinda ill defined as new ones created all the time).

2

There are 2 best solutions below

5
Nikhil Kothari On

Have you tried tf.contrib.lookup.Hashtable?

Here is an example usage from my own use: https://github.com/TensorLab/tensorfx/blob/master/src/data/_transforms.py#L160 and a made up example snippet based on that:

import tensorflow as tf
session = tf.InteractiveSession()

entries = ['red', 'blue', 'green']
table = tf.contrib.lookup.HashTable(
    tf.contrib.lookup.KeyValueTensorInitializer(entries,
                                                tf.range(0, len(entries), dtype=tf.int64),
                                                tf.string, tf.int64),
    default_value=len(entries), name='entries')
tf.tables_initializer().run()

value = tf.constant([['blue', 'red'], ['green', 'red']])
print(table.lookup(value).eval())

I believe lookup works for both regular tensors and SparseTensors (you might end up with the latter given your variable length list of values).

1
rhaertel80 On

There are a couple of issues to tackle here. First, is the question about a tag set which keeps growing. You would also like to know how to parse variable-length data from CSV.

To handle a growing tag set, you'll need to use an OOV or feature hashing. Nikhil showed the latter, so I'll show the former.

How to parse variable-length data from CSV

Let's suppose the column with variable length data uses | as a separator, e.g.

csv = [
  "1,oscars|brad-pitt|awards",
  "2,oscars|film|reviews",
  "3,matt-damon|bourne",
]

You can use code like this to convert those to a SparseTensor.

import tensorflow as tf

# Purposefully omitting "bourne" to demonstrate OOV mappings.
TAG_SET = ["oscars", "brad-pitt", "awards", "film", "reviews", "matt-damon"]
NUM_OOV = 1

def sparse_from_csv(csv):
  ids, post_tags_str = tf.decode_csv(csv, [[-1], [""]])
  table = tf.contrib.lookup.index_table_from_tensor(
      mapping=TAG_SET, num_oov_buckets=NUM_OOV, default_value=-1)
  split_tags = tf.string_split(post_tags_str, "|")
  return ids, tf.SparseTensor(
      indices=split_tags.indices,
      values=table.lookup(split_tags.values),
      dense_shape=split_tags.dense_shape)

# Optionally create an embedding for this.
TAG_EMBEDDING_DIM = 3

ids, tags = sparse_from_csv(csv)

embedding_params = tf.Variable(tf.truncated_normal([len(TAG_SET) + NUM_OOV, TAG_EMBEDDING_DIM]))
embedded_tags = tf.nn.embedding_lookup_sparse(embedding_params, sp_ids=tags, sp_weights=None)

# Test it out
with tf.Session() as s:
  s.run([tf.global_variables_initializer(), tf.tables_initializer()])
  print(s.run([ids, embedded_tags]))

You'll see output like so (since the embedding is random, exact numbers will change):

[array([1, 2, 3], dtype=int32), array([[ 0.16852427,  0.26074541, -0.4237918 ],
       [-0.38550434,  0.32314634,  0.858069  ],
       [ 0.19339906, -0.24429649, -0.08393878]], dtype=float32)]

You can see that each column in the CSV is represented as an ndarray, where the tags are now 3-dimensional embeddings.