import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
'b': [1,1,1,0,0,0,0],
})
grouped = df.groupby('b')
now sample from each group, e.g., I want 30% from group b = 1, and 20% from group b = 0. How should I do that?
if I want to have 150% for some group, can i do that?
You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing
replace=True:np.select, create a new columncthat returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set.samplex rows per group based off these percentage conditions. From these rows, return the.indexof the rows and filter for the rows with.locas well as columns'a','b'. The codegrouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0]))creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the.indexand filter the original dataframe with.loc, rather than try to clean up the messy multiindex series.If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass
replace=True. Then, do some cleanup to get the output.