Pandas sample different fractions for each group after groupby

2.6k Views Asked by double At 21 December 2020 at 03:00

import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0],
})

grouped = df.groupby('b')

now sample from each group, e.g., I want 30% from group b = 1, and 20% from group b = 0. How should I do that? if I want to have 150% for some group, can i do that?

Original Q&A

There are 2 best solutions below

David Erickson On 21 December 2020 at 04:30 BEST ANSWER

You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing replace=True:

Using np.select, create a new column c that returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set.
From there, you can sample x rows per group based off these percentage conditions. From these rows, return the .index of the rows and filter for the rows with .loc as well as columns 'a','b'. The code grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])) creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the .index and filter the original dataframe with .loc, rather than try to clean up the messy multiindex series.

grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]: 
   a  b
6  7  0
8  9  0
3  4  1

If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass replace=True. Then, do some cleanup to get the output.

grouped = df.groupby('b', group_keys=False)
v = df['b'].value_counts()
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)],
                    [int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled.
(grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True))
        .reset_index()
        .rename({'index' : 'a'}, axis=1))
Out[2]: 
    a  b
0   7  0
1   8  0
2   9  0
3   7  0
4   7  0
5   8  0
6   1  1
7   3  1
8   3  1
9   1  1
10  0  1
11  0  1
12  4  1
13  2  1
14  3  1
15  0  1

BrenBarn On 21 December 2020 at 04:05

You can get a DataFrame from the GroupBy object with, e.g. grouped.get_group(0). If you want to sample from that you can use the .sample method. For instance grouped.get_group(0).sample(frac=0.2) gives:

   a
5  6

For the example you give both samples will only give one element because the groups have 4 and 3 elements and 0.2*4 = 0.8 and 0.3*3 = 0.9 both round to 1.

Pandas sample different fractions for each group after groupby

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in GROUP-BY

Related Questions in SAMPLE-DATA

Trending Questions

Popular # Hahtags

Popular Questions