Splitting a list of folds into training and validation sets

71 Views Asked by At

I have created code that splits data into folds (7 in this case). In effect, I have a list of lists of 7 folds of data.

I now want to go through these and split into training and validation sets within each fold and store these as data frames.

As a newcomer, I have tried manual methods, groupsghufflesplit split() and so on but can't get the output I need. The methods are as follows:

def k_folds(data, k):
    """function that returns a list of k folds of the data"""
    
    ############################
    len_folds = find_fold_sizes(data, k)
    ############################

    folds = []
    for i in range(k):
        data_ss = data.sample(n=len_folds[i], random_state=20)
        data = data.drop(data_ss.index)
        folds.append(data_ss)

    return folds 

(len_folds is a calculate of the length of each fold - in this case around 42 or 43 as using 300 rows of data.

This returns a list of 7 folds (0-6) in one big list.

I am then trying to use code such as

for i, fold in enumerate(folds):
        # Generate the training/testing visualizations for each CV split
    gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
    train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))

to create training and validation sets for each. This however gives me an output of 1 of the datasets or if I try to output a single frame using train_dataset[1] for example, I just get a number say 3.

I am an absolute beginner out of my depth so please accept my apologies if this is stupid but ant advice would be most welcome. Thank you in advance

1

There are 1 best solutions below

1
Maria K On

In this line for each iteration of the for-loop train_dataset and test_dataset are just redefined with new sets on indices. So in the end you always get data for the last fold.

train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))

If you want to store indices for each iteration, you can create 2 lists and append new sets of indices to them.

train_datasets, test_datasets = [], []

for i, fold in enumerate(folds):
        # Generate the training/testing visualizations for each CV split
    gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
    train_dataset, test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))
    train_datasets.append(train_dataset)
    test_datasets.append(test_dataset)