I have created a Monte-Carlo simulation model implemented in Tensorflow 2.5. The model mostly consists of vector multiplications inside a tf.while_loop. I am benchmarking the performance on a Linux machine with 8 virtual CPUs. When I run the model in graph mode (without XLA optimization), the model fully utilizes all 8 CPUs (I can see the %CPU to be close to 800% using the top command). However, when I run model after compiling with XLA (by using jit_compile=True inside @tf.function decorator), I can see the %CPU utilization to be close to 250%. Is there a way to force Tensorflow to utilize all available CPU capacity with XLA.
I have experimented with the changing the inter_op_parallelism and intra_op_parallelism settings. While setting both of the threads settings to 1 reduces the CPU utilization from 250% to 100%, increasing them to 8 doesn't increase the utilization beyond 250%.
Any help and suggestions on what might be going on?
I had the same question. Using the suggestions found here: https://www.tensorflow.org/xla I modified the JIT compile sequence for my ML model to something like
This produces an object (*.o) file in
/tmp/dumpwhich I disassembled withobjdump -d. Looking at the disassembly, it appears that the compiler has generated straight-line code for the model and computational kernels rather than calling out to libraries that might support parallel execution. I don't see anything that suggests the possibility of parallel execution of this JIT-ted model, although like you I do observe parallel execution when I simply call the model.However, for me the best performance for this particular model comes from using
@tf.function()withjit_compile=False. In this case I observe 'intra_op' parallelism happening - but no 'inter_op' parallelism which is also observed when simply calling the model.