I want to code a monte carlo simulation to find the optimal batch size m for a given probability p

149 Views Asked by At

I have a total number of samples which is equal to 10^6, but I do not want to test all of these samples, rather I want to determine an optimal batch size m using the monte carlo simulation for a given probability value. The probability values are between 10^-1 and 10^-4. The way to test the optimal batch size is as follows: If m tests negative (no one has covid), this gives the result for all m individuals in the batch. If m tests positive (someone has covid), all samples in the batch need to be tested, to find out who is infected

import numpy as np

def simulate_batch_testing(N, p, k):
    infected = np.random.choice([0, 1], size=N, p=[1-p, p])
    
    # Mix k samples in a batch
    mixed_sample = np.sum(np.random.choice(infected, size=k))
    
    if mixed_sample == 0:
        # If mixed sample tests negative, all individuals in the batch are clear
        return k
    else:
        # If mixed sample tests positive, retest all individuals in the batch
        return N

def monte_carlo_simulation(N, p, num_simulations):
    results = []
    
    for k in range(1, N + 1):
        total_tests = 0
        for _ in range(num_simulations):
            total_tests += simulate_batch_testing(N, p, k)
        avg_tests = total_tests / num_simulations
        results.append((k, avg_tests))
    
    return results

# Parameters
N = 10**6  # Number of samples
num_simulations = 1000  # Number of Monte Carlo simulations

# Probability values
p_values = [10**-1, 10**-2, 10**-3, 10**-4]

# Perform simulations for different p values
for p in p_values:
    simulation_results = monte_carlo_simulation(N, p, num_simulations)
    min_avg_tests = min(simulation_results, key=lambda x: x[1])
    
    print(f"For p = {p}, optimal batch size k is {min_avg_tests[0]} with an average of {min_avg_tests[1]:.2f} tests.")


I am not sure about the implementation of this code as it never seems to terminate and I am also not sure about the monte carlo simulation

1

There are 1 best solutions below

0
Malcolm On

As discussed in the comments the primary problem here is that as written the simulate_batch_testing function is going to get called len(p_values) * N * num_simulations times, which with the given parameters comes to 4e+09. So even though simulate_batch_testing runs in roughly 0.0126 seconds, the overall running time will be over a year and a half.

Unfortunately the simulate_batch_testing function also isn't quite simulating the batch testing correctly. As written it takes one random k sized sample out of the population N and then acts like all of the k sized samples will require the same retesting or not.

To properly simulate batching the N samples into k sized samples you want something more like

def simulate_batch_testing(N, p, k):
    # create the total population
    population = np.random.choice([0, 1], size=N, p=[1-p, p])
    n_tests = 0
    # check each k sized batch
    for j in range(0, N, k):
        n_tests += 1
        if population[j:j+k].sum() > 0:
            n_tests += k
    return n_tests

which actually ends up being even two times slower than the original version.

I was able to get that back down to around the original time by writing it as

def simulate_batch_testing(N, p, k):
    # pad the population out to an even multiple of k with uninfected people
    padding = N % k
    N_padded = N + padding
    n_batches = N_padded // k
    population = np.pad(np.random.choice([0, 1], size=N, p=[1-p, p]), (0, padding))
    # use reshape to group into the batches
    groups = population.reshape(n_batches, k)
    # identify the groups that need retests
    retests = groups.any(axis=1)
    # calculate and return the total number of tests required
    return n_batches + retests.sum() * k

Perhaps some Numpy guru can provide some further optimization, but this is the best I have so far, and this still leaves a total simulation that will take way too long. So, I would suggest reducing the num_simulations to 100, and limiting k to only go up to N // 2 as it doesn't make a lot of sense to consider batches larger than half the entire population. This brings the total calculation down to about a week per p_value. So the simplest parallelization (assuming you have at least 4 cores to run on), is to run each of your p_values as a separate run; no multiprocessing code required. Then you can get your complete simulated answer in a week.

If you do have a lot more cores at your disposal, then setting up a multiprocessing.Pool to run a large number of simulate_batch_testing calls in parallel will get your answer even faster.