Scaling not all numeric columns in Train and Test data sets of a mixed data frame

179 Views Asked by At

The below code scales Train and Test sets. Since Col6 and Col7 must not be scaled, they were removed from the original data to scale Train and Test sets:

library(tidyverse)

Data_Frame <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
                         
                         Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                         
                         Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
                         
                         Col4 = c(4, 12, 2, 1, 4, 4, 75, 4, 44),
                         
                         Col5 = c(7.81, 6.45, 3, 1, 3, 2, 5, 1, 2),
                         
                         Col6 = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
                         
                         Col7 = c(2, 2, 2, 2, 2, 2, 2, 2, 2),
                         
                         Col8 = c(7.77, 6, 8.4, -11.23, 3.5, 7.2, 15, 100, 22.22))

# randomly split data in r
sample_size = floor(0.8*nrow(Data_Frame))
set.seed(777)
picked = sample(seq_len(nrow(Data_Frame)),size = sample_size)
Train_Set = Data_Frame[picked,]
Test_Set = Data_Frame[-picked,]

# Remove columns Col6 and Col7, which will not be scaled
Train <- Train_Set %>% dplyr::select(- c(Col6, Col7))
Test <- Test_Set %>% dplyr::select(- c(Col6, Col7))

# Scale Train, collect mean and sd to scale in Test
Train_Scale <- Train %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)
num_cols <- names(which(sapply(Train,is.numeric)))
scale_params <- attributes(scale(Train[,num_cols]))[c("scaled:center","scaled:scale")]

# Scale Test with the scales of Train
Test_Scale <- Test
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]]) 

Trying

varnames <- c('Col6', 'Col7')
index <- names(Train_Set) %in% varnames
Train_Scale_Check <- Train_Set[, !index] %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector) 

works, but removes Col6 and Col7 from the data frame.

And,

Train_Scale_Check <- Train_Set %>% dplyr::mutate_if(is.numeric, !index, ~scale(.) %>% as.vector)

throws the following error:

Error: expecting a one sided formula, a function, or a function name.
Run `rlang::last_error()` to see where the error occurred.

rlang::last_error()
<error/rlang_error>
expecting a one sided formula, a function, or a function name.
Backtrace:
 1. dplyr::mutate_if(...)
 2. dplyr:::manip_if(...)
 3. dplyr:::as_fun_list(.funs, .env, ..., .caller = .caller)
 4. dplyr:::map(...)
 5. base::lapply(.x, .f, ...)
 6. dplyr:::FUN(X[[i]], ...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
expecting a one sided formula, a function, or a function name.
Backtrace:
    x
 1. \-dplyr::mutate_if(...)
 2.   \-dplyr:::manip_if(...)
 3.     \-dplyr:::as_fun_list(.funs, .env, ..., .caller = .caller)
 4.       \-dplyr:::map(...)
 5.         \-base::lapply(.x, .f, ...)
 6.           \-dplyr:::FUN(X[[i]], ...)

Is there a simple way to preserve Col6 and Col7 in Train_Set and Test_Set data sets, but not scale them? The lengthy way being extracting the columns Col6 and Col7 as a separate data frame, use the code at the top and cbind the Col6, and Col7 data frames finally.

1

There are 1 best solutions below

0
Ray On

The following resolved the issue (thanks to the sugggestion of @27 ϕ 9)

Scale the training set only at required columns (negelecting Col6 and Col7)

varnames <- c('Col6', 'Col7')
index <- names(Train_Set) %in% varnames
Train_Scale <- Train_Set %>%  mutate(across(where(is.numeric) & -all_of(varnames), ~scale(.) %>% as.vector))

Pick up the scales:

num_cols <- names(which(sapply(subset(Train_Set, select=-c(Col6, Col7)), is.numeric)))
scale_params <- attributes(scale(Train_Set[,num_cols]))[c("scaled:center","scaled:scale")]

Use the scales in the test data:

Test_Scale <- Test_Set
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]])