I have created code that splits data into folds (7 in this case). In effect, I have a list of lists of 7 folds of data.
I now want to go through these and split into training and validation sets within each fold and store these as data frames.
As a newcomer, I have tried manual methods, groupsghufflesplit split() and so on but can't get the output I need. The methods are as follows:
def k_folds(data, k):
"""function that returns a list of k folds of the data"""
############################
len_folds = find_fold_sizes(data, k)
############################
folds = []
for i in range(k):
data_ss = data.sample(n=len_folds[i], random_state=20)
data = data.drop(data_ss.index)
folds.append(data_ss)
return folds
(len_folds is a calculate of the length of each fold - in this case around 42 or 43 as using 300 rows of data.
This returns a list of 7 folds (0-6) in one big list.
I am then trying to use code such as
for i, fold in enumerate(folds):
# Generate the training/testing visualizations for each CV split
gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))
to create training and validation sets for each. This however gives me an output of 1 of the datasets or if I try to output a single frame using train_dataset[1] for example, I just get a number say 3.
I am an absolute beginner out of my depth so please accept my apologies if this is stupid but ant advice would be most welcome. Thank you in advance
In this line for each iteration of the for-loop
train_datasetandtest_datasetare just redefined with new sets on indices. So in the end you always get data for the last fold.If you want to store indices for each iteration, you can create 2 lists and append new sets of indices to them.