CatBoost Grid Search usage with SMOTE and get_dummies

110 Views Asked by Roman Durnik At 02 September 2023 at 07:05

I try to use the grid_search method from catboost library to pick the best parameters and do a cross validation, but for some reason it doesn't work well - roc_auc_score is awful and confusion matrix shows that the second class was determined really bad.

The dataset is imbalanced and I use the SMOTE method to do oversampling. Also I use get_dummies to encode the categorical features to make some visualisations of parameters correlation.

I made a simple catboost model without grid_search and it did really well, but for some reason when I tried to use grid_search it didn't work well - roc_auc_score = 0.5. Maybe there is a comflict between grid_search, SMOTE and get_dummies.

# I divide the training and test datasets into features and the target variables
X = df_train.drop(['ID', 'MARKER'], axis=1)
y = df_train['MARKER']
answer = df_test['MARKER']
df_test.drop(['ID', 'MARKER'], axis=1, inplace=True)

# Encoding
X = pd.get_dummies(X, drop_first=True)
df_test = pd.get_dummies(df_test, drop_first=True)

# Splitting the dataset into train and test parts and using SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

catboost_model_1 = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
     eval_metric='AUC:hints=skip_train~false'
)

catboost_model_1.fit(X_train_resampled, y_train_resampled,
                eval_set=(X_test, y_test),
            verbose=False,
            plot=True)

roc_auc_score(answer, catboost_model_1.predict(df_test))
0.7308330531593068

confusion_matrix(answer, catboost_model_1.predict(df_test))
array([[34696,  3563],
       [   65,    81]], dtype=int64)

catboost_model_2 = CatBoostClassifier(eval_metric='AUC:hints=skip_train~false',                                     logging_level='Silent')

grid_params = {'learning_rate': [0.01, 0.05, 0.1, 0.3, 0.5, 0.7],
               'iterations': [50, 100, 200, 400],
           'depth': [6, 8, 10]}


catboost_model_2.grid_search(grid_params, X_train_resampled, y_train_resampled, plot=True, verbose=False, refit=True)

roc_auc_score(answer, catboost_model_2.predict(df_test))
0.4994968694096601

confusion_matrix(answer, catboost_model_2.predict(df_test))
array([[38222,    37],
       [  145,     1]], dtype=int64)

I tried to use grid_search without get_dummies and SMOTE by adding stratified=True attribute and passing the pool with categorical features instead of do encoding, but it also did not work - roc_auc_score was about 0.508

X = df_train.drop(['ID', 'MARKER'], axis=1)
y = df_train['MARKER']
answer = df_test['MARKER']
df_test.drop(['ID', 'MARKER'], axis=1, inplace=True)

pool = Pool(X, y, cat_features=['Sex', 'Region', 'Job_title', 'Education', 'Marriage', 'Children', 'Property', 'R', 'Employment_status', 'T', 'U', 'V',
'W', 'X'])

catboost_model_3 = CatBoostClassifier(eval_metric='AUC:hints=skip_train~false',                                     logging_level='Silent')

grid_params = {'learning_rate': [0.01, 0.05, 0.1, 0.3, 0.5, 0.7],
               'iterations': [50, 100, 200, 400],
           'depth': [6, 8, 10]}


catboost_model_3.grid_search(grid_params, pool, stratified=True, plot=True, verbose=False, refit=True)

Original Q&A

CatBoost Grid Search usage with SMOTE and get_dummies

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in GRID-SEARCH

Related Questions in CATBOOST

Trending Questions

Popular # Hahtags

Popular Questions