How Create Python Cluster Sample by Group ID Without Lambda Function?

49 Views Asked by At

I have stratified clustered sample with a different number of observations within each strata. I want a random cluster sample with replacement from each strata where the number sampled is n-1 where n is the number of observations in a strata. Im bootstrapping and the lambda function is incredibly slow, so ideally the solution wouldn't use a lambda function.

I have tried using a lambda function, which works. Im worried it's too slow to be viable for the bootstrap procedure I am creating. For example:

import pandas as pd 
example_data=[['1','A','1',33,23,3,2],['1','A','2',37,20,3,2],['1','A','3',30,27,3,2],['1','A','4',36,21,3,2],['1','A','5',33,23,3,2],['1','B','1',38,20,3,2],\
                     ['1','B','2',39,20,3,2],['1','B','3',33,20,3,2],['1','B','4',33,23,3,2],['1','C','1',27,25,3,2],['1','C','2',28,26,3,2],['2','E','1',38,21,2,1],\
                     ['2','E','2',39,22,2,1],['2','F','1',37,21,2,1],['2','F','2',40,21,2,1],['3','G','1',32,26,4,3],['3','G','2',32,27,4,3],['3','H','1',38,28,4,3],\
                     ['3','H','2',41,28,4,3],['3','H','3',46,22,4,3],['3','H','4',44,23,4,3],['3','H','5',44,28,4,3],['3','H','6',45,30,4,3],['3','I','1',34,29,4,3],\
                     ['3','I','2',32,24,4,3],['3','J','1',25,23,4,3],['3','J','2',21,26,4,3],['3','J','3',22,27,4,3],['4','K','1',20,21,1,1],['4','K','2',24,27,1,1],\
                      ['4','K','3',20,20,1,1]]
df_ex=pd.DataFrame(example_data,columns=['strata','store_id','product','weight','size','stores_in_strata','number_stores_to_sample_from_strata'])
df_ex

Then sample clusters within a strata with replacement and merge with original data to get all product observations per sampled cluster:

samp_dict=dict(zip(df_ex.strata,df_ex.number_stores_to_sample_from_strata))
sampdf1=df_ex[['strata', 'store_id', 'stores_in_strata','number_stores_to_sample_from_strata']]
sampdf2=sampdf1.groupby('strata').apply(lambda group: group.sample(samp_dict[group.name],replace=True)).reset_index(drop=True)
sampdf4=sampdf2[['strata','store_id']]
rth_samp=pd.merge(df_ex,sampdf4,on=['strata','store_id'],how='inner')

The real dataset is very large, 3-4 million observations. The way things are written currently, it takes 3-5 minutes per loop. I'd love to hear ideas of how to speed this up significantly.

Edit: provided minimal working example.

0

There are 0 best solutions below