I have a total number of samples which is equal to 10^6, but I do not want to test all of these samples, rather I want to determine an optimal batch size m using the monte carlo simulation for a given probability value. The probability values are between 10^-1 and 10^-4. The way to test the optimal batch size is as follows: If m tests negative (no one has covid), this gives the result for all m individuals in the batch. If m tests positive (someone has covid), all samples in the batch need to be tested, to find out who is infected
import numpy as np
def simulate_batch_testing(N, p, k):
infected = np.random.choice([0, 1], size=N, p=[1-p, p])
# Mix k samples in a batch
mixed_sample = np.sum(np.random.choice(infected, size=k))
if mixed_sample == 0:
# If mixed sample tests negative, all individuals in the batch are clear
return k
else:
# If mixed sample tests positive, retest all individuals in the batch
return N
def monte_carlo_simulation(N, p, num_simulations):
results = []
for k in range(1, N + 1):
total_tests = 0
for _ in range(num_simulations):
total_tests += simulate_batch_testing(N, p, k)
avg_tests = total_tests / num_simulations
results.append((k, avg_tests))
return results
# Parameters
N = 10**6 # Number of samples
num_simulations = 1000 # Number of Monte Carlo simulations
# Probability values
p_values = [10**-1, 10**-2, 10**-3, 10**-4]
# Perform simulations for different p values
for p in p_values:
simulation_results = monte_carlo_simulation(N, p, num_simulations)
min_avg_tests = min(simulation_results, key=lambda x: x[1])
print(f"For p = {p}, optimal batch size k is {min_avg_tests[0]} with an average of {min_avg_tests[1]:.2f} tests.")
I am not sure about the implementation of this code as it never seems to terminate and I am also not sure about the monte carlo simulation
As discussed in the comments the primary problem here is that as written the
simulate_batch_testingfunction is going to get calledlen(p_values) * N * num_simulationstimes, which with the given parameters comes to 4e+09. So even thoughsimulate_batch_testingruns in roughly 0.0126 seconds, the overall running time will be over a year and a half.Unfortunately the
simulate_batch_testingfunction also isn't quite simulating the batch testing correctly. As written it takes one randomksized sample out of the populationNand then acts like all of theksized samples will require the same retesting or not.To properly simulate batching the
Nsamples intoksized samples you want something more likewhich actually ends up being even two times slower than the original version.
I was able to get that back down to around the original time by writing it as
Perhaps some Numpy guru can provide some further optimization, but this is the best I have so far, and this still leaves a total simulation that will take way too long. So, I would suggest reducing the
num_simulationsto 100, and limitingkto only go up toN // 2as it doesn't make a lot of sense to consider batches larger than half the entire population. This brings the total calculation down to about a week perp_value. So the simplest parallelization (assuming you have at least 4 cores to run on), is to run each of yourp_values as a separate run; no multiprocessing code required. Then you can get your complete simulated answer in a week.If you do have a lot more cores at your disposal, then setting up a
multiprocessing.Poolto run a large number ofsimulate_batch_testingcalls in parallel will get your answer even faster.