I have a model trained under tf.distribute.MultiWorkerMirroredStrategy(), which could be run without errors. However, the training time doesn't decrease as expected compared with training with single-worker.
I checked some details and there're two main things which I suspect there should be something wrong with the autoshard:
Each worker are caching all of the data from my data source.
The outputs per epoch shows a strange accuracy value of 1.9, which is exactly the sum of accuracy on two workers. (checked with 3 workers and accuracy is close to 3 then)
I turn off the shuffle as suggested on this tutorial when using tf.data.Dataset.list_files , but the problem remains.