Is there a proper way to apply median imputation by groups in caret?

13 Views Asked by maglorismyspiritanimal At 23 February 2024 at 12:19

I'm a beginner in machine learning, and I'm trying to do logistic regression on the titanic data set from Kaggle. I want to impute the Age variable using the titles (Mr, Master, Miss, etc.) contained in the variable name, so that a missing Age value would correspond to the median age of each title.

My problem is that I also want to try cross validation using the caret R package, and I want to avoid information leakage, so I think I have to impute each fold separately. I just can't figure out how to do it.

I can impute the training set the way I want to, but I don't know how to do it inside the preProcess function.

This is what I want to happen, but it's implemented on the entire training data

# Here's some sample data:
n <- 800

data <- data.frame(
  Pclass = as.factor(sample(1:3, n, replace = TRUE)),
  Survived = as.factor(sample(0:1, n, replace = TRUE)),
  Sex = as.factor(sample(c("male", "female"), n, replace = TRUE)),
  Age = sample(c(6, 9, 26, 40, 45, 58, NA), n, replace = TRUE),
  Fare = runif(n, min = 5, max = 40),
  Embarked = as.factor(sample(c("S", "C", "Q"), n, replace = TRUE)),
  Family = sample(0:2, n, replace = TRUE)
)

data$Title <- ifelse(data$Sex == "female","Miss",
                     ifelse(data$Age > 14 & !is.na(data$Age), "Mr", "Master"))

# Here's how I impute Age based on Title:

data <- data %>%
  group_by(Title) %>%
  mutate(Mean_age=round(mean(Age, na.rm = TRUE))) %>%
  mutate(Age=ifelse(is.na(Age),Mean_age,Age)) %>%
  select(-c(Mean_age, Title))

I want the imputation to happen inside the train funciton so that each fold of the cross validation gets its own imputation.

trctrl <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5)

fit1 <- train(Survived ~ ., data = trn,
              trControl = trctrl,
              method = "glm",
              family = "binomial",
              preProc = c('center', 'scale', 'nzv'),
              na.action = na.pass
              )
summary(fit1)

Original Q&A

Is there a proper way to apply median imputation by groups in caret?

There are 0 best solutions below

Related Questions in LOGISTIC-REGRESSION

Related Questions in CROSS-VALIDATION

Related Questions in TRAINING-DATA

Related Questions in IMPUTATION

Related Questions in CARET

Trending Questions

Popular # Hahtags

Popular Questions