SHAP KernelExplainer not accepting a DMatrix nor a numpy array

108 Views Asked by At

I am trying to make plots of a SHAP analysis of a XGBoost model I trained. Something similar to this.

However, I used Dart booster, so shap.TreeExplainer does not work. Then, I am trying to use the shap.KernelExplainer which should work for me. However, it is not accepting any common type of input.

My code is like this:

First Attempt

# Data to predict
full_data = xgb.DMatrix(full_X, label=full_y, feature_names=feature_names)

# Pre-trained XGB model using DART booster
loaded_model.set_param({"device": "cuda"})


xgb_predict = lambda x: loaded_model.predict(x)
explainer = shap.KernelExplainer(xgb_predict, full_data)

And I get :

TypeError: Unknown type passed as data object: <class 'xgboost.core.DMatrix'>

Second Attempt

I have also tried to provide a numpy array:

X_np = np.array(full_X)

explainer = shap.KernelExplainer(xgb_predict, X_np)

But it also returns an error:

TypeError: ('Expecting data to be a DMatrix object, got: ', <class 'numpy.ndarray'>)

I am using shap 0.44.0 and xgboost 2.0.2

How can I resolve the problem?

1

There are 1 best solutions below

0
Caio Atila On

What is really happening

In case someone else faces this problem, here is what I found:

The shap.KernelExplainer tries to convert the data (source code in here and here):

def convert_to_data(val, keep_index=False):
    if isinstance(val, Data):
        return val
    elif type(val) == np.ndarray:
        return DenseData(val, [str(i) for i in range(val.shape[1])])
    elif str(type(val)).endswith("'pandas.core.series.Series'>"):
        return DenseData(val.values.reshape((1,len(val))), list(val.index))
    elif str(type(val)).endswith("'pandas.core.frame.DataFrame'>"):
        if keep_index:
            return DenseDataWithIndex(val.values, list(val.columns), val.index.values, val.index.name)
        else:
            return DenseData(val.values, list(val.columns))
    elif sp.sparse.issparse(val):
        if not sp.sparse.isspmatrix_csr(val):
            val = val.tocsr()
        return SparseData(val)
    else:
        assert False, "Unknown type passed as data object: "+str(type(val))

So it basically does not recognize the xgboost.core.DMatrix type. But, if one enters a Dataframe or a numpy array, it passes this conversion but then fails when it is given to the model, because the model was trained with a DMatrix.

The workaround

To solve this, I have passed a pandas DataFrame as data to shap.KernelExplainer and added a conversion to a DMatrix inside the supplied function that returns the model's predictions:


def xgb_predict(X, model = loaded_model, target=full_y, features=feature_names):

    # Conversion to a DMatrix
    full_data = xgb.DMatrix(X, feature_names=features)
    
    return model.predict(full_data)

# full_X is a pandas DataFrame
explainer = shap.KernelExplainer(model = xgb_predict, data = full_X)