Data sharding seems not work properly with tf.distribute.MultiWorkerMirroredStrategy()

22 Views Asked by At

I have a model trained under tf.distribute.MultiWorkerMirroredStrategy(), which could be run without errors. However, the training time doesn't decrease as expected compared with training with single-worker.

I checked some details and there're two main things which I suspect there should be something wrong with the autoshard:

  1. Each worker are caching all of the data from my data source.

  2. The outputs per epoch shows a strange accuracy value of 1.9, which is exactly the sum of accuracy on two workers. (checked with 3 workers and accuracy is close to 3 then)

I turn off the shuffle as suggested on this tutorial when using tf.data.Dataset.list_files , but the problem remains.

0

There are 0 best solutions below