i'm trying to use categorical variable support of XGBoost. I'm following XGBoost's own documentation for categorical data. (linked here : https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html) . Although i'm specifying the data type of string features as category and setting the 'enable_categorical' parameter to 'True' while defining the regressor, it keeps throwing this error. I would appreciate your help.

Info of training data :

 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   column1                   96529 non-null  category
 1   column2                   96529 non-null  category
 2   column3                   96529 non-null  category
 3   column4                   96529 non-null  category
 4   column5                   96529 non-null  category
 5   column6                   96529 non-null  float64 
 6   column7                   96529 non-null  float64 
 7   column8                   96529 non-null  float64 
 8   column9                   96529 non-null  float64 
 9   column10                  96529 non-null  int64   
 10  column11                  96529 non-null  int64   
 11  column12                  96529 non-null  int64   
 12  column13                  96529 non-null  float64 
 13  column14                  96529 non-null  float64 
 14  column15                  96529 non-null  float64 
 15  column16                  96529 non-null  float64 
 16  column17                  96529 non-null  float64 
 17  column18                  96529 non-null  int64   
 18  column19                  96529 non-null  float64 
 19  column20                  96529 non-null  int64   
 20  column21                  96529 non-null  float64 
 21  column22                  96529 non-null  float64 
 22  column23                  96529 non-null  float64 
 23  column24                  96529 non-null  float64 
dtypes: category(5), float64(14), int64(5)

Code i'm trying :

xgb_model = xgb.XGBRegressor(tree_method="gpu_hist", enable_categorical=True, max_depth = 128,n_estimators=1000,min_child_weight=25,learning_rate=0.025)

xgb_model.fit(X_train, y_train, early_stopping_rounds=10, eval_set=[(X_val, y_val)])

Error:

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.column1, column2, column3, column4, column5
1

There are 1 best solutions below

0
user1808924 On

Could it be that you cast the first five columns of the dataset to the category data type, and then performed a train-test split on on it? If so, then it could be that the sklearn.model_selection.train_test_split utility method simply performed a reverse cast from the Pandas' category data type back to good old Numpy's object data type.

The idea is that Scikit-Learn utility methods default to Numpy data containers. The category data type can't survive there (requires a Pandas' data container), so it gets undone.

TLDR: print out the data types of your X_train object to double-check that they really are category at that point still.