I try to use the grid_search method from catboost library to pick the best parameters and do a cross validation, but for some reason it doesn't work well - roc_auc_score is awful and confusion matrix shows that the second class was determined really bad.
The dataset is imbalanced and I use the SMOTE method to do oversampling. Also I use get_dummies to encode the categorical features to make some visualisations of parameters correlation.
I made a simple catboost model without grid_search and it did really well, but for some reason when I tried to use grid_search it didn't work well - roc_auc_score = 0.5. Maybe there is a comflict between grid_search, SMOTE and get_dummies.
# I divide the training and test datasets into features and the target variables
X = df_train.drop(['ID', 'MARKER'], axis=1)
y = df_train['MARKER']
answer = df_test['MARKER']
df_test.drop(['ID', 'MARKER'], axis=1, inplace=True)
# Encoding
X = pd.get_dummies(X, drop_first=True)
df_test = pd.get_dummies(df_test, drop_first=True)
# Splitting the dataset into train and test parts and using SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
catboost_model_1 = CatBoostClassifier(
iterations=100,
learning_rate=0.1,
depth=6,
eval_metric='AUC:hints=skip_train~false'
)
catboost_model_1.fit(X_train_resampled, y_train_resampled,
eval_set=(X_test, y_test),
verbose=False,
plot=True)
roc_auc_score(answer, catboost_model_1.predict(df_test))
0.7308330531593068
confusion_matrix(answer, catboost_model_1.predict(df_test))
array([[34696, 3563],
[ 65, 81]], dtype=int64)
catboost_model_2 = CatBoostClassifier(eval_metric='AUC:hints=skip_train~false', logging_level='Silent')
grid_params = {'learning_rate': [0.01, 0.05, 0.1, 0.3, 0.5, 0.7],
'iterations': [50, 100, 200, 400],
'depth': [6, 8, 10]}
catboost_model_2.grid_search(grid_params, X_train_resampled, y_train_resampled, plot=True, verbose=False, refit=True)
roc_auc_score(answer, catboost_model_2.predict(df_test))
0.4994968694096601
confusion_matrix(answer, catboost_model_2.predict(df_test))
array([[38222, 37],
[ 145, 1]], dtype=int64)
I tried to use grid_search without get_dummies and SMOTE by adding stratified=True attribute and passing the pool with categorical features instead of do encoding, but it also did not work - roc_auc_score was about 0.508
X = df_train.drop(['ID', 'MARKER'], axis=1)
y = df_train['MARKER']
answer = df_test['MARKER']
df_test.drop(['ID', 'MARKER'], axis=1, inplace=True)
pool = Pool(X, y, cat_features=['Sex', 'Region', 'Job_title', 'Education', 'Marriage', 'Children', 'Property', 'R', 'Employment_status', 'T', 'U', 'V',
'W', 'X'])
catboost_model_3 = CatBoostClassifier(eval_metric='AUC:hints=skip_train~false', logging_level='Silent')
grid_params = {'learning_rate': [0.01, 0.05, 0.1, 0.3, 0.5, 0.7],
'iterations': [50, 100, 200, 400],
'depth': [6, 8, 10]}
catboost_model_3.grid_search(grid_params, pool, stratified=True, plot=True, verbose=False, refit=True)