How can a 1:1 stratified sampling be performed in python?
Assume the Pandas Dataframe df to be heavily imbalanced. It contains a binary group and multiple columns of categorical sub groups.
df = pd.DataFrame({'id':[1,2,3,4,5], 'group':[0,1,0,1,0], 'sub_category_1':[1,2,2,1,1], 'sub_category_2':[1,2,2,1,1], 'value':[1,2,3,1,2]})
display(df)
display(df[df.group == 1])
display(df[df.group == 0])
df.group.value_counts()
For each member of the main group==1 I need to find a single match of group==0 with.
A StratifiedShuffleSplit from scikit-learn will only return a random portion of data, not a 1:1 match.
If I understood correctly you could use np.random.permutation:
Output
Note that this solution assumes the size of the each possible sub_category combination of
group 1is less than the size of the corresponding sub-group ingroup 0. A more robust version involves using np.random.choice with replacement:The version with choice does not have the same assumption as the one with permutation, although it requires at least one element for each sub-category combination.