I am trying to build a model with two stages, an unsupervised and a supervised learning process. First, I want to perform a dimension reduction with a kernel PCA and determine the main components. Then I would like to use an XGBoost method to work with a target variable. First, I try to determine the optimal hyperparameters for the KPCA using cross-validation. My initial approach was as follows:
pipeline = Pipeline([('scaler', StandardScaler()),
('kpca', KernelPCA()),
('xgb', XGBClassifier())])
param_grid = {'kpca__n_components':[3, 4, 6, 8],
'kpca__kernel':['linear', 'rbf', 'poly'],
'kpca__gamma': np.linspace(0.03, 0.05, 10)}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X,y)
My data set X consists of several metric variables (company key figures), which are to be standardized in the first step, and a few categorical variables, which have already been adjusted with pd.get_dummies. The target variable y is a binary indicator for the event whose probability is to be determined later using the XGBoost model.
Does it make sense to define both procedures in the pipeline, or should the KPCA be carried out separately from the follow-up procedure? And if this is the case, which approach is the right one to define the scoring parameter of the GridSearchCV function?
Does this then have to be customized? (Like it is described in this thread:
https://github.com/ageron/handson-ml/issues/629)