How can to access the CTR values in Catboost?

94 Views Asked by At

When using Catboost on data with a categorical variable, CTR values are calculated during training for each value of this categorical feature. These values are then used to determine paths through the tree at prediction time. Given a trained model, how can I access these CTR values?

(Please note that our model uses non symmetric trees, for which model export to Python or C++ is not supported.)

What I've tried / partial progress: I can see the CTR values in the JSON export, but these are stored next to the hash of each feature value, not the feature value itself. If I knew how the hash was calculated (and what exactly was hashed, i.e. is it just the feature name?) then I would have the CTR values.

1

There are 1 best solutions below

1
popstack On

Solving this took some effort, so I'll answer here for others.

The CTR values are available in the JSON export of a Catboost model. Specifically, you can find the CTR values in jsonexport['ctr_data'][feature_identifier]['hash_map']. This is a list that looks like:

hash_value, ctr_0, ctr_1, ..., ctr_k, hash_value, ctr_0, ctr_1, ..., ctr_k, hash_value, ... 

The hash values are hashes of the categorical feature values, while the integers ctr_i are raw counts, which are combined to form the true CTR values in the manner described here. The hash values are computed in the following manner:

from cityhash import CityHash64

MAX_INT = 0xffffffffffffffff
MAGIC_MULT = 0x4906ba494954cb65

def feature_value_hash(fv):
    """
    For the provided feature value (a string) return its corresponding key to the hash table of CTR values (a string).
    """
    ch = CityHash64(fv.encode('ascii'))
    ch_32_lsb = (np.uint64(ch) & np.uint64(0xffffffff)).astype(np.int32)
    return str((MAGIC_MULT * ((MAGIC_MULT * int(ch_32_lsb)) & MAX_INT)) & MAX_INT)

Note that it is crucial to use this specific older version of Cityhash. The hash function above was deduced from the Python export and was correct for our model (which used counters of type "Buckets").