I have a ML task, and as a preprocessing step, there is a lot of work on CPU to be done which can take >1 hour. Before, I was using the pandarallel library to parallelize this work, the documentation says that it will use all CPUs. But when using pytorch DDP on slurm, now there are multiple (4) processes running. I can have just one process run as before, or I can split the csv into 4 partitions and then run each of those in parallel.
Is this second approach faster, or am I just redundantly parallelizing?
I know that DDP is multiprocess but I'm not sure about the pandarallel library. I do notice that I need to specify less workers per DDP process for pandarallel when running with DDP, else I will get an OOM error.