Split Train Test Data sets keeping like values together

1k Views Asked by AlmostThere At 01 October 2020 at 21:19

I have a data set of animal types with ID's and I want to break said data set into Test/Train data sets. I also want to keep all ID's for a respective animal within either the Train or Test data set. An example of the data is below with a random Train/Test split ratio of 80/20.

Animal  ID  Test/Train
CAT 1   TRAIN
CAT 1   TRAIN
CAT 2   TRAIN
CAT 2   TRAIN
CAT 3   TRAIN
CAT 3   TEST
CAT 4   TRAIN
CAT 4   TRAIN
CAT 5   TEST
CAT 5   TRAIN
DOG 1   TRAIN
DOG 1   TRAIN
DOG 2   TRAIN
DOG 2   TRAIN
DOG 3   TRAIN
DOG 3   TRAIN
DOG 4   TEST
DOG 4   TEST
DOG 5   TRAIN
DOG 5   TRAIN

Note how CAT with ID 3 and ID 5 exists in both Train and Test data sets. Is there a function within scikit-learn train_test_split that enables the ability to keep all like values in a column within the same train/test data set while maintaining the test ratio? So if CAT with ID 3 has one value flagged as Train data then any other records with CAT and ID 3 would also be flagged as Train data.

Original Q&A

There are 2 best solutions below

Aditya Jha On 01 October 2020 at 21:26

Did you keep the stratify parameter as yes if so then remove it and check.

Davide Pietrasanta On 30 June 2022 at 08:40

I found the solution to your request: Here's a link!

from sklearn.model_selection import GroupShuffleSplit 

splitter = GroupShuffleSplit(test_size=0.2, n_splits=2, random_state = 7)
split = splitter.split(df, groups=df['ID'])
train_inds, test_inds = next(split)

train = df.iloc[train_inds]
test = df.iloc[test_inds]

Split Train Test Data sets keeping like values together

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in DATA-SCIENCE

Related Questions in TRAIN-TEST-SPLIT

Trending Questions

Popular # Hahtags

Popular Questions