Drawing a random sub-sample from a df proportionally to categories

84 Views Asked by At

I have a dataframe like this

names = ["Patient 1", "Patient 2", "Patient 3", "Patient 4", "Patient 5", "Patient 6", "Patient 7"]
categories = ["Internal medicine, Gastroenterology", "Internal medicine, General Med, Endocrinology", "Pediatrics, Medical genetics, Laboratory medicine", "Internal medicine", "Endocrinology", "Pediatrics", "General Med, Laboratory medicine"]

zippedList =  list(zip(names, categories))
df = pd.DataFrame(zippedList, columns=['names', 'categories'])

yielding:

print(df)
names                                         categories
0  Patient 1                Internal medicine, Gastroenterology
1  Patient 2      Internal medicine, General Med, Endocrinology
2  Patient 3  Pediatrics, Medical genetics, Laboratory medicine
3  Patient 4                                  Internal medicine
4  Patient 5                                      Endocrinology
5  Patient 6                                         Pediatrics
6  Patient 7                   General Med, Laboratory medicine

(The real data-frame has >1000 rows)

and counting the categories yields:

print(df['categories'].str.split(", ").explode().value_counts())

Internal medicine      3
General Med            2
Endocrinology          2
Laboratory medicine    2
Pediatrics             2
Gastroenterology       1
Medical genetics       1

I would like to draw a random sub-sample of n rows so that each medial category is proportionally represented. e.g. 3 of 13 (~23%) categories are "Internal medicine". Therefore ~23% of the sub-sample should have this category. This wouldn't be too hard if each patient had 1 category but unfortunately they can have multiple (eg patient 3 got even 3 categories). How can I do this?

1

There are 1 best solutions below

1
Ludovic H On

The fact your patients have many categories doesn't affect the subsampling process. When you take n rows out of nrows (which is len(df) ), subsampling will maintain the categories weights, +/- the probability one class is more represented in your random subset -it converges to 0 as n gets higher-

Typically,

n = 2000
df2 = df.sample(n).copy(deep = True)
print(df2['categories'].str.split(", ").explode().value_counts())

should work the way you want.

I also read you have around 1000 categories. Do not forget to preprocess them before using, as some of them could disappear after your subsampling fit.