i'm trying to use categorical variable support of XGBoost. I'm following XGBoost's own documentation for categorical data. (linked here : https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html) . Although i'm specifying the data type of string features as category and setting the 'enable_categorical' parameter to 'True' while defining the regressor, it keeps throwing this error. I would appreciate your help.
Info of training data :
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column1 96529 non-null category
1 column2 96529 non-null category
2 column3 96529 non-null category
3 column4 96529 non-null category
4 column5 96529 non-null category
5 column6 96529 non-null float64
6 column7 96529 non-null float64
7 column8 96529 non-null float64
8 column9 96529 non-null float64
9 column10 96529 non-null int64
10 column11 96529 non-null int64
11 column12 96529 non-null int64
12 column13 96529 non-null float64
13 column14 96529 non-null float64
14 column15 96529 non-null float64
15 column16 96529 non-null float64
16 column17 96529 non-null float64
17 column18 96529 non-null int64
18 column19 96529 non-null float64
19 column20 96529 non-null int64
20 column21 96529 non-null float64
21 column22 96529 non-null float64
22 column23 96529 non-null float64
23 column24 96529 non-null float64
dtypes: category(5), float64(14), int64(5)
Code i'm trying :
xgb_model = xgb.XGBRegressor(tree_method="gpu_hist", enable_categorical=True, max_depth = 128,n_estimators=1000,min_child_weight=25,learning_rate=0.025)
xgb_model.fit(X_train, y_train, early_stopping_rounds=10, eval_set=[(X_val, y_val)])
Error:
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
`enable_categorical` must be set to `True`.column1, column2, column3, column4, column5
Could it be that you cast the first five columns of the dataset to the
categorydata type, and then performed a train-test split on on it? If so, then it could be that thesklearn.model_selection.train_test_splitutility method simply performed a reverse cast from the Pandas'categorydata type back to good old Numpy'sobjectdata type.The idea is that Scikit-Learn utility methods default to Numpy data containers. The
categorydata type can't survive there (requires a Pandas' data container), so it gets undone.TLDR: print out the data types of your
X_trainobject to double-check that they really arecategoryat that point still.