Why aren't modifications made to a python module's variables propagating to new parallel processes?

39 Views Asked by At

I have an embarrassingly parallel list of tasks I want to perform. Currently I'm importing the configuration for these tasks as a module.

example one-line configuration.py:

result_folder = "aFolder"

Up until now I have been calling my function in series instead of in parallel

def embarassing(x, conf):
    print(x)
    print(conf.result_folder)
    # ... do complicated things and return a value

if __name__ == "main":
    import configuration as conf
    x = 1
    y = embarassing(x, conf)

Now I've updated the code to take advantage of running these tasks in parallel.

from dask.distributed import Client
# ...
if __name__ == "main":
    import configuration as conf
    client = Client(n_workers=1)
    x = 1
    future = client.submit(embarassing, x, conf)
    y = future.result()

This all works fine. The problem is sometimes I want to run an ad-hoc set of cases, and until now I could always add

import configuration as conf
x = 2
conf.result_folder = "newFold"

and the code would print

2
newFold

but under the parallel code, it prints

2
aFolder

Why can't I pass this module as an argument any more?

1

There are 1 best solutions below

0
mdurant On

distributed uses pickle to send values to the worker. For modules, this is essentially just the name of the module, so that the worker does an import rather than saving the current state of the module, sending that and then recreating it in the worker.

If you want to send information to the worker tasks this way, you will need to send an ordinary variable (class instance, etc) rather than a module object. Alternatively, you could make a closure, or run the conf.result_folder = "newFold" line on the worker, or ... probably lots of options.