I'm training a binary classification model with xgboost in R using xgboost package. I use R package ParBayesianOptimization to tune some hyperparameters. Here is my code.
# dtrain is my input. It's a matrix with 247 features and 423 samples in total (373 positive sample with label 1 and 50 negative sample with label 0)
scoring_function <- function(eta, gamma, max_depth,colsample_bytree,alpha,nfold) {
pars <- list(
eta = eta,
gamma = gamma,
max_depth = max_depth,
# min_child_weight = min_child_weight,
colsample_bytree = colsample_bytree,
alpha = alpha,
# subsample = subsample,
objective = "binary:logistic",
eval_metric = "auc",
verbosity = 0
)
xgbcv <- xgb.cv(
params = pars,
data = dtrain,
nfold = nfold,
nrounds = 100,
prediction = TRUE,
showsd = TRUE,
early_stopping_rounds = 10,
maximize = TRUE,
stratified = TRUE
)
return(
list(
Score = max(xgbcv$evaluation_log$test_auc_mean),
nrounds = min(xgbcv$best_iteration)
)
)
}
# set boundary for parameters
bounds <- list(
eta = c(0.1, 0.5),
gamma =c(0, 1.9),
max_depth = c(3L, 10L),
alpha = c(0,0.7),
colsample_bytree = c(0,0.4),
nfold = c(3L, 5L)
)
# Perform the search
library(ParBayesianOptimization)
set.seed(123)
time_noparallel <- system.time(
opt_obj <- bayesOpt(
FUN = scoring_function,
bounds = bounds,
initPoints = 7,
iters.n = 8
))
besttune=getBestPars(opt_obj)
#use the best parameters to construct model
params_1=list(
eta=besttune$eta,
gamma =besttune$gamma,
max_depth = besttune$max_depth,
objective = "binary:logistic",
eval_metric = "auc",
colsample_bytree = besttune$colsample_bytree
# min_child_weight = besttune$min_child_weight,
# #alpha = besttune$alpha,
# subsample=besttune$subsample
)
nrounds=max(na.omit(opt_obj$scoreSummary$nrounds)[
(which(na.omit(opt_obj$scoreSummary$Score)
== max(na.omit(opt_obj$scoreSummary$Score))))])
set.seed(123)
xgb2 <- xgb.train(params = params_1,
data = dtrain,
nrounds = nrounds,
watchlist = list(val=dtest,train=dtrain),
print_every_n = 10,
maximize = F
)
But when I check the feature importance matrix of the model xgb2 I found that it only contains 20 features, yet my classmate who use random forest picks up more than 30 features in the end. So is the anything wrong about my procedure? If I want to make my model pick similar features as that in the random forest model, how can I tune the hyperparameters? I want to use ParBayesianOptimization package if possible.
I want my xgboost model contains features similar to those in random forest model using the same dataset.