I design a method to split the data into 5 folds, then I want to use it to perform 5-folds cross-validation.
from load_data import load_data
folds, test_samples, input_shape = load_data()
folds[0].keys()
# dict_keys(['train', 'val', 'test'])
To use that specific 5 folds to optimize GBM model, for any optimization method (e.g., RandomSearch, GridSearch,...), I need to train 5 models each hyper-parameter configuration and then evaluate model performance.
A way to do that, I iterate each fold to train a model using
early_stopping = lgb.early_stopping(stopping_rounds=10)
model = lgb.LGBMClassifier()
model.fit(X, y, callbacks=[early_stopping],...)
Another way I found it is lgb.cv, which does not allow my folds to fit in.
Does anyone have idea how to implement lgb.cv without using its splitting?
This is a snippet code for 1 configuration
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from timeit import default_timer as timer
for i, fold in enumerate(folds):
print('Fold', i+1)
train, val, test = folds[fold].values()
early_stopping = lgb.early_stopping(stopping_rounds=10)
model = lgb.LGBMClassifier()
start = timer()
model.fit(train['x'], train['y'],
callbacks=[early_stopping],
eval_set=[
(train['x'], train['y']),
(val['x'], val['y']),
(test['x'], test['y'])],
eval_names=['train', 'val', 'test'],
eval_metric=['auc', 'binary_logloss'],
feature_name=feat_names)
train_time = timer() - start
# Make predictions
predictions = model.predict_proba(val['x'])
auc = roc_auc_score(val['y'], predictions[:, 1])
acc = accuracy_score(val['y'], np.argmax(predictions, axis=1))
print('The validation accuracy on the validation set is {:.4f}.'.format(acc))
print('The validation auc on the validation set is {:.4f}.'.format(auc))
print('The training time is {:.4f} seconds'.format(train_time))
How could I adapt it in the way of optimization (e.g., RandomSearch)?