I am now trying to run preprocessing tasks of DLRM with Apache Beam https://github.com/tensorflow/models/tree/master/official/recommendation/ranking/preprocessing. The dataset is Criteo Kaggle 10GB and I have used the script shard_balancer.py to split it into 512 subfiles.
The problem is that when I run the program in my local machine (DirectRunner), the performance is even worse when I increase direct_num_workers to a higher value (with one sub-512 file).
I want to know is it necessary to run Apache Beam in Google Cloud Dataflow? Or there is some necessary optimizations? Or this is because I need to read/write from the disk multiple times? Thanks.
I have tried in both AMD EPYC 7313 16-Core Processor and Intel(R) Xeon(R) Gold 6248 CPU. When the input dataset is sub-512, for AMD CPU, the running time is 65s (1 thread), 159s (8 thread), 357s (16 threads); for Intel CPU, the running time is 93s (1 thread), 318s (8 thread), 626s (16 thread).
What I expect is that with increased number of threads, the performance should be better. But more threads only result in worse performance.