I'm writing a script that does the following:
- Gets the number of available VMs and put them in a stack
- Discover tests files and put them in a queue
- Runs the tests concurrently on each VM
To achieve the concurrency, I'm using ThreadPoolExecutor, where the number of max workers is the initial number of VMs:
with concurrent.futures.ProcessPoolExecutor(max_workers=len(vms)) as executor:
while not tests.empty():
test_path = tests.popleft()
vm = vms.pop()
executor.submit(run_test, test_path, vm).add_done_callback(test_finished)
The test_finished callback checks the test result, and returns the vm to the pool of available VMs. However, if a test fails, I don't want to return that VM to the pool.
My problem is that the number of VMs is now lower than the number of the executor's max workers, so the next test that gets queued won't get an available VM. In that case, I want to lower the number of the workers, spin a new VM in the background (to replace the failed one) and increase back the number of workers.
I think that my solution is to either dynamically control the number of workers and decrease it when a test fails, so the next test that gets queued will wait for the next available worker.
Another solution is to make the test fetch the available VM. In case it gets None, then spin up a new VM, and make the test wait for the new VM to be available (which is about 100 seconds).
However, in this solution approach, I'm not sure how to do the waiting.
What do you think is my best solution to achieve this?