relaxing the anomaly detection constraints in MLOps

89 Views Asked by At

I am stucked in getting this block of statement.In my serving sets there are some anomalies.

I am not getting what this below code is doing in removing the anomalies.

payer_code = tfdv.get_feature(schema, 'payer_code')
payer_code.distribution_constraints.min_domain_mass = 0.9 
2

There are 2 best solutions below

0
jerem_y On

These lines of code relax the minimum fraction of values that must come from the domain for the feature payer-code. Basically it allows you to tolerate a fraction of missing values in your serving datasets vs your training schema, therefore considering previously detected anomalies as valid values.

0
Karl Gardner On

The concept of detecting anomalies get's very complicated and I was confused about the same thing as to what the min_domain_mass actually was. After doing a little research this is my interpretation of it. According to the following reference: https://www.tensorflow.org/tfx/data_validation/anomalies there are a number of different anomaly types that can be detected from the tfdv.detect_anomalies() method. According to the reference we have an anomaly detected of type ENUM_TYPE_UNEXPECTED_STRING_VALUES which has a detection condition of:

"Either (number of values in rank_histogram* that are not in domain / total number of values) > (1 - feature.distribution_constraints.min_domain_mass)"

Now what does this mean? I searched the web and found the following resource: https://notebook.community/GoogleCloudPlatform/tf-estimator-tutorials/00_Miscellaneous/tfx/03_eda_with_tfdv that was slightly helpful and used the phrase "Only allow 10% of the values to be new". So according to this the min_domain_mass has nothing to do with the actual fraction of string values in the payer_code feature that the schema has (the string "FR" if you are using coursera MLops specialization). This 90% is the required percentage of examples in the serving set that has the payer_value feature that has a string in domain of the schema (payer_code domain consists of the following strings: 'BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'). Thus, from all the examples in the serving dataset we can have 10% of examples with the payer_code feature having a string not in the schema domain.

However, I do not know the difference between the min_fraction attribute: tfdv.get_feature(schema, 'payer_code').presence.min_fraction vs the min_domain_mass attribute. The "Detection Condition" for the min_fraction: is (features.common_stats.num_non_missing* / num_examples*) < feature.presence.min_fraction while the "Detection Condition for the min_domain_mass is: (number of values in rank_histogram* that are not in domain / total number of values) > (1 - feature.distribution_constraints.min_domain_mass). This seems like the same thing to me as the fraction of values that can be missing but I would have to look into the rank_histogram attribute more to understand.