I'm a beginner in machine learning, and I'm trying to do logistic regression on the titanic data set from Kaggle. I want to impute the Age variable using the titles (Mr, Master, Miss, etc.) contained in the variable name, so that a missing Age value would correspond to the median age of each title.
My problem is that I also want to try cross validation using the caret R package, and I want to avoid information leakage, so I think I have to impute each fold separately. I just can't figure out how to do it.
I can impute the training set the way I want to, but I don't know how to do it inside the preProcess function.
This is what I want to happen, but it's implemented on the entire training data
# Here's some sample data:
n <- 800
data <- data.frame(
Pclass = as.factor(sample(1:3, n, replace = TRUE)),
Survived = as.factor(sample(0:1, n, replace = TRUE)),
Sex = as.factor(sample(c("male", "female"), n, replace = TRUE)),
Age = sample(c(6, 9, 26, 40, 45, 58, NA), n, replace = TRUE),
Fare = runif(n, min = 5, max = 40),
Embarked = as.factor(sample(c("S", "C", "Q"), n, replace = TRUE)),
Family = sample(0:2, n, replace = TRUE)
)
data$Title <- ifelse(data$Sex == "female","Miss",
ifelse(data$Age > 14 & !is.na(data$Age), "Mr", "Master"))
# Here's how I impute Age based on Title:
data <- data %>%
group_by(Title) %>%
mutate(Mean_age=round(mean(Age, na.rm = TRUE))) %>%
mutate(Age=ifelse(is.na(Age),Mean_age,Age)) %>%
select(-c(Mean_age, Title))
I want the imputation to happen inside the train funciton so that each fold of the cross validation gets its own imputation.
trctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
fit1 <- train(Survived ~ ., data = trn,
trControl = trctrl,
method = "glm",
family = "binomial",
preProc = c('center', 'scale', 'nzv'),
na.action = na.pass
)
summary(fit1)