Splitting data into training, test and validation sets depending on variable dependent for machine learning

125 Views Asked by At

I am trying to split my data into training, test and validation groups within my data. I have 2 groups: control and TP and within these groups I have a secondary variable called Bio with numbers in both groups 1-4.

Within the groups I need to split based on the treatment group (control or TP) and then based on Bio as a dependent variable so that if I have Control 1 in the training set I have all of the control 1 groups and all of the TP 1 as well. Whilst my example data below has equal numbers in the Bio groupings e.g. 3 this is not the same with the rest of the data and there are different numbers in different Bio's.

Please see a minumum data set below:

Sample    Treatment Bio  285.945846 286.9638976 288.1004758 288.8109355
Control1_A13   Control   1 0.003535191 0.001777255 0.004729780 0.002364995
Control1_A14   Control   1 0.005063256 0.000110063 0.006249624 0.001041584
Control1_A15   Control   1 0.004262099 0.000836256 0.004277461 0.002699177
Control2_B13   Control   2 0.002411720 0.000466887 0.001129674 0.001109870
Control2_B14   Control   2 0.003085647 0.001831629 0.002482230 0.000000000
Control2_B15   Control   2 0.001996473 0.001060616 0.003995243 0.001369387
Control3_C13   Control   3 0.000299744 0.000851944 0.002808119 0.004065315
Control3_C14   Control   3 0.003187073 0.000591202 0.006833653 0.001713096
Control3_C15   Control   3 0.003692511 0.000262144 0.004673039 0.000126174
Control4_D13   Control   4 0.003369294 0.001087459 0.005171894 0.000675702
Control4_D14   Control   4 0.003818057 0.000838719 0.005513885 0.000458708
Control4_D15   Control   4 0.002572840 0.000257058 0.003537029 0.000009040
LX2+TP1_E1          TP   1 0.003347067 0.001231945 0.008181087 0.004436654
LX2+TP1_E2          TP   1 0.001552547 0.001463769 0.008864838 0.002728083
LX2+TP1_E3          TP   1 0.003224648 0.000812735 0.008518836 0.004303950
LX2+TP2_F1          TP   2 0.001705551 0.000182659 0.000911028 0.000240785
LX2+TP2_F2          TP   2 0.000760944 0.000759464 0.002486596 0.002377735
LX2+TP2_F3          TP   2 0.001034440 0.000647382 0.008146538 0.001028800
LX2+TP3_G1          TP   3 0.003660741 0.001260433 0.008046637 0.003182006
LX2+TP3_G2          TP   3 0.001802459 0.000547580 0.004882082 0.004121552
LX2+TP3_G3          TP   3 0.003590003 0.000089100 0.002801237 0.000403527
LX2+TP4_H1          TP   4 0.002831592 0.001534135 0.009151124 0.003021942
LX2+TP4_H2          TP   4 0.001863099 0.000959953 0.008284829 0.005169246
LX2+TP4_H3          TP   4 0.005649448 0.001959382 0.011814467 0.004110110

I have tried 2 different methods to do this:

  • Method 1
set.seed(1234)
inTraining <- createDataPartition(vis_data2$Treatment, p=0.6, list=FALSE)
training.set <- vis_data2[inTraining,]
Totalvalidation.set <- vis_data2[-inTraining,]
# This will create another partition of the 40% of the data, so 20%-testing and #20%-validation
inValidation <- createDataPartition(Totalvalidation.set$Treatment, p=0.5, list=FALSE)
testing.set <- Totalvalidation.set[inValidation,]
validation.set <- Totalvalidation.set[-inValidation,]

However this doesn't take into account the second variable for me - Bio groupings

  • Method 2
set.seed(1)
#Split into training and validation data sets
Y1 = vis_data2[,1] #defining treatment/ variable column 
g1 = vis_data2[,3] #defines group column
final_vis_data <- sample.split(Y1,SplitRatio = 0.5,group = g1)
table(Y1,final_vis_data) #get correct split ratios
split(final_vis_data,g1) #while keeping samples with the same group label together
full_train_set <- vis_data2[ final_vis_data,]
test.set <- vis_data2[!final_vis_data,]

#Split training data set into training and testing data sets
Y2 = full_train_set[,1] #defining treatment/ variable column 
g2 = full_train_set[,3] #defines group column
final_vis_data2 <- sample.split(Y2,SplitRatio = 0.5,group = g2)
table(Y2,final_vis_data2) #get correct split ratios
split(final_vis_data2,g2) #while keeping samples with the same group label together
test.set <- full_train_set[final_vis_data2,1:3]
validation.set <- full_train_set[!final_vis_data2,1:3]

However, when I run this I often get 'na' values in my validation.index and often when I check the split the Bio data hasn't split correctly.

How to get this to work?

1

There are 1 best solutions below

2
Seth On

This answer uses functions from rsample and does not use Caret's partitioning function. It will hopefully help you create an initial split for model fitting.

To demonstrate splitting test data as you described for validation sets I needed to make some extra groups.

set.seed(123)
library(rsample)

df_split <- group_initial_split(df, group = Bio, prop = 0.6)

df_training <- training(df_split)
df_testing <- testing(df_split)

df_validation <- group_validation_split(df_testing, group = Bio, prop = 0.5)

df_analysis <- analysis(df_validation$splits[[1]])
df_assessment <- assessment(df_validation$splits[[1]])

levels(factor(df_training$Bio))
#> [1] "2"  "3"  "6"  "8"  "9"  "10"
levels(factor(df_testing$Bio))
#> [1] "1" "4" "5" "7"
levels(factor(df_analysis$Bio))
#> [1] "1" "5"
levels(factor(df_assessment$Bio))
#> [1] "4" "7"

Created on 2023-08-17 with reprex v2.0.2