Description
I’m trying to train a model using TPU VM v3-32 with tensorflow 2.14.1. I have already trained this model with TPU v3-8. I want to compare the training speed between v3-8 and v3-32. However, when training with v3-32, the training is taking longer (3h per epoch) than with v3-8 (~2h per epoch).
After some epochs the speed difference still remains. I've tried different values for steps_per_execution argument from the compile method but I don't see any improvement.
I ran the tensorboard profiler to investigate what could be the issue and saw that all compute is done on the host rather than on the devices (TPU cores). Below are some screenshots from the tensorboard profiler.
Questions
- Does anyone know if there is anything else I need to configure the TPU VM?
- How do I properly make sure I'm using of all 32 cores from the TPU VM?
Context
Below I describe the steps I followed for setting up the TPU VM v3-32:
- Create the TPU VM v3-32:
gcloud alpha compute tpus tpu-vm create $TPU_NAME --node-id $TPU_NAME --project $PROJECT_ID --zone $ZONE --accelerator-type v3-32 --runtime-version tpu-vm-tf-2.14.1-pod
- Once created, ssh into the TPU VM and set the environment variables:
export TPU_NAME="your-tpu-name"
export TPU_LIBRARY_PATH=/lib/libtpu.so
export TPU_LOAD_LIBRARY=0
export PROJECT_ID="specified-project-id"
export ZONE="specified-zone"
- Install the tensorflow build provided in the TPU VM:
pip install /usr/share/tpu/tensorflow-2.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Install remaining dependencies from
requirements.txt.Run the code sample from the documentation too test the the TPU is working. I got the same output but I see an additional info log message indicating that not tpu was found although the TPUStrategy displays correctly the nº of devices.
I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:28] FindAndLoadTpuLibrary failed with FAILED_PRECONDITION: TPU_LOAD_LIBRARY=0, not loading libtpu. This is expected if TPU is not used.I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:73] No TPU platform found.All TPU devices: []
2024-02-06 10:06:28.415115: I tensorflow/compiler/xla/stream_executor/tpu/tpu_initializer_helper.cc:242] Libtpu path is: /lib/libtpu.so
2024-02-06 10:06:28.455424: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:28] FindAndLoadTpuLibrary failed with FAILED_PRECONDITION: TPU_LOAD_LIBRARY=0, not loading libtpu. This is expected if TPU is not used.
2024-02-06 10:06:28.458234: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Tensorflow version 2.14.1
2024-02-06 10:06:32.318797: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (4 tries left)
2024-02-06 10:06:33.318914: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (3 tries left)
2024-02-06 10:06:34.319049: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (2 tries left)
2024-02-06 10:06:35.319189: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (1 tries left)
2024-02-06 10:06:36.319326: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:73] No TPU platform found.
All TPU devices: []
2024-02-06 10:06:36.730063: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://localhost:56786
Number of devices: 32
PerReplica:{
0: tf.Tensor(2.0, shape=(), dtype=float32),
1: tf.Tensor(2.0, shape=(), dtype=float32),
2: tf.Tensor(2.0, shape=(), dtype=float32),
3: tf.Tensor(2.0, shape=(), dtype=float32),
4: tf.Tensor(2.0, shape=(), dtype=float32),
5: tf.Tensor(2.0, shape=(), dtype=float32),
6: tf.Tensor(2.0, shape=(), dtype=float32),
7: tf.Tensor(2.0, shape=(), dtype=float32),
8: tf.Tensor(2.0, shape=(), dtype=float32),
9: tf.Tensor(2.0, shape=(), dtype=float32),
10: tf.Tensor(2.0, shape=(), dtype=float32),
11: tf.Tensor(2.0, shape=(), dtype=float32),
12: tf.Tensor(2.0, shape=(), dtype=float32),
13: tf.Tensor(2.0, shape=(), dtype=float32),
14: tf.Tensor(2.0, shape=(), dtype=float32),
15: tf.Tensor(2.0, shape=(), dtype=float32),
16: tf.Tensor(2.0, shape=(), dtype=float32),
17: tf.Tensor(2.0, shape=(), dtype=float32),
18: tf.Tensor(2.0, shape=(), dtype=float32),
19: tf.Tensor(2.0, shape=(), dtype=float32),
20: tf.Tensor(2.0, shape=(), dtype=float32),
21: tf.Tensor(2.0, shape=(), dtype=float32),
22: tf.Tensor(2.0, shape=(), dtype=float32),
23: tf.Tensor(2.0, shape=(), dtype=float32),
24: tf.Tensor(2.0, shape=(), dtype=float32),
25: tf.Tensor(2.0, shape=(), dtype=float32),
26: tf.Tensor(2.0, shape=(), dtype=float32),
27: tf.Tensor(2.0, shape=(), dtype=float32),
28: tf.Tensor(2.0, shape=(), dtype=float32),
29: tf.Tensor(2.0, shape=(), dtype=float32),
30: tf.Tensor(2.0, shape=(), dtype=float32),
31: tf.Tensor(2.0, shape=(), dtype=float32)
}