Bootstrapping multiple random samples with polars in python

240 Views Asked by At

I have generated a large simulated population polars dataframe using numpy arrays. I want to randomly sample from this population dataframe multiple times. However, when I do that, the samples are exactly the same from sample to sample. I know there must be an easy fix for this, any recommendations? It must be the repeat function, does anyone have any creative ideas for how I can simulate orthogonal multiple random samples?

Here's my code:

N = 1000000 # population size
samples = 1000 # number of samples
num_obs = 100 # size of each sample

# Generate population data
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# Store this in a population dataframe
pop_data_frame = pl.DataFrame({
    'A':a,
    'B':b,
    'X':x,
    'Z':z,
    'Y':y,
    'id':range(1, N+1)
})

# Get 1000 samples from this pop_data_frame...
#... with 100 observations each sample.
sample_list = list(
    repeat(
        pop_data_frame.sample(n=num_obs), samples)
    )
)
1

There are 1 best solutions below

0
jqurious On BEST ANSWER

With .repeat(), you're calling .sample() once and repeating that 1000 times.

You want to call .sample() 1000 times:

sample_list = [ pop_data_frame.sample(num_obs) for _ in range(samples) ]

Or, you could use polars lazy API to create a list of lazyframes and .collect_all() which should be faster as polars can parallelize the operation:

sample_list = pl.collect_all(
   [
      pop_data_frame.lazy().select(
         row = pl.struct(pl.all()).sample(num_obs)
      ).unnest("row") 
      for _ in range(samples)
   ]
)