I have an embarrassingly parallel list of tasks I want to perform. Currently I'm importing the configuration for these tasks as a module.
example one-line configuration.py:
result_folder = "aFolder"
Up until now I have been calling my function in series instead of in parallel
def embarassing(x, conf):
print(x)
print(conf.result_folder)
# ... do complicated things and return a value
if __name__ == "main":
import configuration as conf
x = 1
y = embarassing(x, conf)
Now I've updated the code to take advantage of running these tasks in parallel.
from dask.distributed import Client
# ...
if __name__ == "main":
import configuration as conf
client = Client(n_workers=1)
x = 1
future = client.submit(embarassing, x, conf)
y = future.result()
This all works fine. The problem is sometimes I want to run an ad-hoc set of cases, and until now I could always add
import configuration as conf
x = 2
conf.result_folder = "newFold"
and the code would print
2
newFold
but under the parallel code, it prints
2
aFolder
Why can't I pass this module as an argument any more?
distributed uses pickle to send values to the worker. For modules, this is essentially just the name of the module, so that the worker does an import rather than saving the current state of the module, sending that and then recreating it in the worker.
If you want to send information to the worker tasks this way, you will need to send an ordinary variable (class instance, etc) rather than a module object. Alternatively, you could make a closure, or run the
conf.result_folder = "newFold"line on the worker, or ... probably lots of options.