Using "user defined weights" for an ensemble model

52 Views Asked by At

I want to create an ensemble model with "user defined weights". If I create multiple submodels using tidymodels, I want to produce a final model that puts equal weight on each submodel. The package stacks is great for producing more optimal weights... but sometimes I just want to put equal weight on each submodel. Also... stacks is great because I can then use the "stacked" model object with the DALEXtra package to help explain the final ensemble model.

Here is an example of something I'm doing.

## load in packages
library(tidymodels)
library(stacks)
library(DALEXtra)

# get a sample of the ames dataset
set.seed(1)
df <- ames %>% 
  sample_n(500)

# some setup: resampling and a basic recipe
set.seed(1)
df_splits <- initial_split(df)
df_train <- training(df_splits)
df_test  <- testing(df_splits)

set.seed(1)
df_folds <- vfold_cv(df_train, v = 4)

rec_small <- recipe(Sale_Price ~ Gr_Liv_Area, data = df)
rec_big <- recipe(Sale_Price ~ BsmtFin_SF_1 + First_Flr_SF + Second_Flr_SF, data = df)

# setting up my one model type
rand_forest_ranger_spec <-
  rand_forest() %>%
  set_engine('ranger') %>%
  set_mode('regression')

# setting up my one workflow set of my two recipes and one model type
wf_rfs <- 
  workflow_set(
    preproc = list(rec_small,
                   rec_big), 
    models = list(rf = rand_forest_ranger_spec)
    )

# estimating my two random forest models
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )

grid_results <-
  wf_rfs %>%
  workflow_map(
    seed = 1503,
    resamples = df_folds,
    control = grid_ctrl
  )

# setting up our stacking
stacks()

df_st <- 
  stacks() %>%
  add_candidates(grid_results)

set.seed(1)
df_model_st <-
  df_st %>%
  blend_predictions()

# looking at final estimated model
df_model_st$equations$numeric
#### i got 
#### -42148.1667470673 + (recipe_1_rf_1_1 * 0.13109783287876) + (recipe_2_rf_1_1 * 1.08833216052151)
#### but what want something like user defined values 
#### 0 + (rec_simple_rf_1_1 * .5) + (rec_big_rf_1_1 * .5)

I could go on with this stacks model, and use DALEXtra to help explain this stacks ensemble model with some global model explainations... Kinda like this...

# Fit an ensemble model using that stacks
df_model_st_fitted <-
  df_model_st %>% 
  fit_members()

# I want to be able to use the cool DALEX tools to explain a user-defined weighted ensemble model
vip_features <- c("Gr_Liv_Area", "BsmtFin_SF_1", "First_Flr_SF", "Second_Flr_SF")

vip_train <- 
  df %>% 
  select(all_of(vip_features))

# Setting up the explainer
explainer_blended_rf <- 
  explain_tidymodels(
    df_model_st_fitted, 
    data = vip_train, 
    y = df$Sale_Price,
    label = "Blended Random Forest",
    verbose = FALSE
  )

# using the explainer to produce a VIP
vip_example <- 
  explain_tidymodels(
    df_model_st_fitted, 
    data = vip_train, 
    y = df$Sale_Price,
    label = "Blended RF",
    verbose = FALSE
  ) %>% 
  model_parts() 

plot(vip_example)

#using the explainer to produce AL plots
al_rf <- model_profile(explainer = explainer_blended_rf,
                       type = "accumulated",
                       variables = names(vip_train)
)

plot(al_rf) +
  ggtitle("Accumulated-local profiles")

In sum... I love stacks and it's ability to both create weights, and creates a model object that can be used later as a tidymodel. But, I don't want the weights created by stacks, I want to create my own weights. I don't know if I should be doing something within stacks to create the weights I want. Or... if I should not be bothering with stacks at all, because I already know the weights I want. But... I don't know how to create an ensemble model like stacks does, to use later like a tidymodel.

1

There are 1 best solutions below

0
jrosell On

One approach here is to manually get the predictions for each model and get a vector calculating the mean of each prediction values stored in a list column on your results tibble.

Something like this:

reduce(results$.pred, \(x, y) x + y) / nrow(results)

To get importances of your stack, in vip package you can use custom wrappers.