I want to infer a schema with tensorflow data validation (tfdv) based on a pandas dataframe of the training data. The dataframe contains a column with a multivalent feature, where multiple values (or None) of the feature can be present at the same time.
Given the following dataframe:
df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
{'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
{'feat_1': 7, 'feat_2': None, 'feat_3': None}])
inferring and displaying the schema results in:
Thus, tfdv treats the 'feat_2' values as a single string instead of splitting them at the ',' to produce a domain of 'AA', 'BB':
If if save the values of feature as e.g., ['AA', 'BB'], the schema inference throws an error:
ArrowTypeError: ("Expected bytes, got a 'list' object", 'Conversion failed for column feat_2 with type object')
Is there any way to achieve this with tfdv?


A
Stringwill be interpreted as aString. Regarding your issue with theList, it might be related to this issue:Could not find anything more recent. Here is a workaround: