Producing filtered random samples which can be replicated using the same seed

38 Views Asked by At

I have $10$ random variables $X_1, X_2, \ldots X_{10} \sim \mathcal{N(0,1)}$, I want to generate $5000$ samples for them such that $X_1 < a$ and $X_i > a \forall i \in {2, 3, \ldots, 10}$.

The problem is I want this to be replicable given the same initial seed. By this I mean that if the seed supplied is the same, this should give rise to identical samples.

I don't know in advance how many samples to generate such that after filtering it gives rise to $5000$ good simulations (meets the filtering criteria). I tried to run it in blocks of say $100,000$ setting the initial seed to some fixed value $s$, if the first block fails to generate a sufficient number of samples, I have to rerun the simulation for another block, but obviously can't use the same seed again. Also, I am not sure if setting the seed to some other value for each block will not give rise to unwanted correlations between the samples in different blocks.

Is there a way to achieve this? I also want to point out that performance is important, so if possible please suggest something amenable to parallelization, or something that can be sped up by using numba.

I am doing it in python.

Thanks a lot.

1

There are 1 best solutions below

2
DataSciRookie On

to address your requirements, we can try with a systematic approach to ensure replicability, meet your filtering criteria, and maintain performance efficiency.

The key here is to manage the random seed progression in such a way that it allows for reproducible results across different runs while also enabling the generation of additional samples if needed.

We need to deal with the seed to ensure replicability and avoid unwanted correlations, we will use a base seed and increment it in a deterministic manner for each new block of generations :

base_seed = 12345

Now, we need to deal with efficient filtering and sampling to handle the performance aspect, especially when we are not sure how many samples to generate initially, we can use a loop that generates a certain number of samples, filters them according to your criteria, and checks if the desired number of samples has been met. If not, it continues with a new seed.

def generate_and_filter_samples(seed, a, num_samples_needed):
    np.random.seed(seed)
    samples_meet_criteria = []

    while len(samples_meet_criteria) < num_samples_needed:
        # Generate samples for X_1 to X_10
        samples = np.random.randn(100000, 10)  # Adjust size as needed

        # Filter samples based on your criteria
        filtered = samples[(samples[:, 0] < a) & np.all(samples[:, 1:] > a, axis=1)]

        # Append to the list
        if filtered.size > 0:
            samples_meet_criteria.extend(filtered.tolist())

    return np.array(samples_meet_criteria[:num_samples_needed])

Finally, we have a last problem to deal with : parallelization and performance For performance, especially with such a large number of samples and the need for filtering, using libraries like NumPy is crucial due to its efficient array operations.

While NumPy does not directly support parallelization in the way we might run multiple processes or threads, it's highly optimized for vectorized operations, making it very fast for operations like generating and filtering large arrays of random numbers. For true parallelization, especially across multiple blocks of generation, we could consider managing separate processes with multiprocessing, each with its seed. However, combining NumPy with Numba for just-in-time compilation can significantly speed up the filtering process if it becomes complex.

@jit(nopython=True)
def generate_and_filter_samples(seed, a, num_samples_needed):
  • Here, is the final code :
import numpy as np
from numba import jit

# Setting a base seed for reproducibility
base_seed = 12345

# Function to generate and filter samples
@jit(nopython=True)
def generate_and_filter_samples(seed, a, num_samples_needed):
    np.random.seed(seed)
    samples_meet_criteria = []

    while len(samples_meet_criteria) < num_samples_needed:
        # Generate samples for X_1 to X_10
        samples = np.random.randn(100000, 10)  # Adjust size as needed

        # Filter samples based on your criteria
        filtered = samples[(samples[:, 0] < a) & np.all(samples[:, 1:] > a, axis=1)]
        if filtered.size > 0:
            samples_meet_criteria.extend(filtered.tolist())

    return np.array(samples_meet_criteria[:num_samples_needed])

# Criterion
a = 0.5
num_samples_needed = 5000

# Generate our samples
final_samples = generate_and_filter_samples(base_seed, a, num_samples_needed)
print(final_samples.shape)