I'm having an issue with the preProc argument for the train function using the R caret package. I want to center and scale my predictors but ignore the factor columns. When I preProcess outside of train, it works fine but I'm hoping to pre-process within the train function. Am I missing something?
Below is an example where the factor predictor is ignored when using preProcess outside of train.
df <- data.frame(
score = runif(1000, 80, 110),
var1 = as.factor(sample(0:1, 1000, replace = TRUE)),
var2 = runif(1000, 5, 25)
)
preProcess(df[-1], method=c('center','scale'))
Created from 1000 samples and 2 variables
Pre-processing:
- centered (1)
- ignored (1)
- scaled (1)
Here is what happens when I use preProc inside of train
df <- data.frame(
score = runif(1000, 80, 110),
var1 = as.factor(sample(0:1, 1000, replace = TRUE)),
var2 = runif(1000, 5, 25)
)
mod <- train(score ~., data = df,
method = "lm",
preProc = c("center", "scale"))
mod$preProcess
Created from 1000 samples and 2 variables
Pre-processing:
- centered (2)
- ignored (0)
- scaled (2)
Your call is dispatched to
train.formulawhere your data is converted to a matrix with the expressionmodel.matrix(Terms, m, contrasts).Since your data are now in matrix form and matrices are atomic the values are coerced to the same type. In this instance
double. This also has an odd side effect of renamingvar1tovar11, which you can see if you inspect themod$preProcessoutput (e.g.mod$preProcess$mean). Not sure why that is the case, but I do not think it is related to your question.It appears the class information is captured before this matrix conversion and ultimately output in the results via the
ptypeelement, but does nothing other than get output:However, the model matrix is what gets passed to
train.defaultwhich then goes on to runpreProcess(). By the time that it reaches that step, the factor information is already stripped and that variable is of classdouble. As you notedpreProcess()does a series of checks and only evaluates on numeric data (numeric in the sense that it is of class "integer", "numeric", or "double"). So whenpreProcess()is called viatrain()your values are alreadydouble, which is why they get scaled and centered.However, the same conversion to a matrix is not made when you call
preProcess()directly and so the factor class is caught and removed before scaling and centering.From the
preProcessargument documentation for?trainit specifies:I think this is what they are getting at -- calling this argument is only for "simple" meaning all values are of the same class. If they are not of the same class they will ultimately be coerced.
Long story short, I think you ought to either pass the preprocessed data to
train()or create a recipe and pass that totrain()like so:If you go the
reciperoute you should read the documentation carefully to see if factors are included inall_numeric_predictors(), I am not sure off the top of my head.