Tuning of GBM model with offset column using h2o and R

50 Views Asked by At

I am trying to train two GBM models, the first one takes the frequency as a response variable and the second takes number of claims as a response and exposure as on offset column, however, I did not see any difference between the two best models when I make hyperparameters tuning. I get the same RMSE.


DF=data[-extreme_ind, ] 
DF[,c(4:60)]<- lapply(DF[,c(4:60)], factor)


df=as.h2o(DF)
splits <- h2o.splitFrame(df, 0.8, seed=1234)  
train <- h2o.assign(splits[[1]], "train.hex")  
valid <- h2o.assign(splits[[2]], "valid.hex") 

MOD_1_v2 <- h2o.gbm(x=c(4:56, 58:60),y = 61, training_frame = train, validation_frame =valid, ntrees=200) #100
summary(MOD_1_v2)

plot(MOD_1_v2,timestep="number_of_trees",metric="RMSE") 





gbm1_parameters <- list(learn_rate = c(0.01,0.05, 0.1),
                        max_depth = c(3, 5, 6),
                        sample_rate = c(0.7, 0.75, 0.8),  
                        col_sample_rate = c(0.2, 0.5, 1.0))



gbm1_grid <- h2o.grid("gbm", x = c(4:56, 58:60), y = 61,
                      grid_id = "gbm_grid",
                      training_frame = train,
                      validation_frame = valid,  
                      ntrees=20, #30
                      seed = 1,
                      hyper_params = gbm1_parameters)



gbm1_gridp<- h2o.getGrid(grid_id = "gbm_grid",
                         sort_by = "rmse",
                         decreasing  = FALSE)
print(gbm1_gridp)


best_MOD_1=h2o.getModel(gbm1_gridp@model_ids[[1]])

summary(best_MOD_1)




best_gbm_perf1 <- h2o.performance(model = best_MOD_1,newdata = valid)
best_gbm_perf1



plot(best_MOD_1,timestep="number_of_trees",metric="rmse")
h2o.varimp_plot(best_MOD_1)



MOD_2_v2 <- h2o.gbm(x=c(4:56, 58:60),y = 2,offset_column="APVI", training_frame = train, validation_frame = valid,ntrees=55) 

summary(MOD_2_v2) #apres supp outliers 

plot(MOD_2_v2,timestep="number_of_trees",metric="RMSE")


gbm2_parameters <- list(learn_rate = c(0.01,0.05, 0.1),
                        max_depth = c(3, 5),
                        sample_rate = c(0.7, 0.75, 0.8),  
                        col_sample_rate = c(0.2, 0.5, 1.0))




gbm2_grid <- h2o.grid("gbm", x = c(4:56, 58:60), y = 2,
                      grid_id = "gbm_grid",
                      training_frame = train,
                      validation_frame = valid, 
                      ntrees=55, #10
                      seed = 123,
                      hyper_params = gbm2_parameters)


gbm2_gridp<- h2o.getGrid(grid_id = "gbm_grid",
                         sort_by = "rmse",
                         decreasing  = FALSE)
print(gbm2_gridp)



best_MOD_2=h2o.getModel(gbm2_gridp@model_ids[[1]])
summary(best_MOD_2)


best_gbm_perf2 <- h2o.performance(model = best_MOD_2,newdata = valid)
best_gbm_perf2

How Can I fix this problem ?

1

There are 1 best solutions below

0
Maurever On

Could you also share the printed output, please?

My first idea is you are using the same grid_id = "gbm_grid"; please try to change the second one to be different.

Also, in your grid settings, the only difference is setting the response column (the first grid y=61, the second grid y=2). I don't see an offset column setting.

I will also try this suggestion with my generic data to see if this is the issue.

Thanks!

Edit: I tried your code with different data and got models with different RMSEs. So please check that your data makes sense.