All compute done in the host rather than on the devices when training with TPU v3-32 using Tensorflow 2.14.1

Question

All compute done in the host rather than on the devices when training with TPU v3-32 using Tensorflow 2.14.1

41 Views Asked by Ricardo At 06 February 2024 at 10:24

Description

I’m trying to train a model using TPU VM v3-32 with tensorflow 2.14.1. I have already trained this model with TPU v3-8. I want to compare the training speed between v3-8 and v3-32. However, when training with v3-32, the training is taking longer (3h per epoch) than with v3-8 (~2h per epoch).

After some epochs the speed difference still remains. I've tried different values for steps_per_execution argument from the compile method but I don't see any improvement.

I ran the tensorboard profiler to investigate what could be the issue and saw that all compute is done on the host rather than on the devices (TPU cores). Below are some screenshots from the tensorboard profiler.

Questions

Does anyone know if there is anything else I need to configure the TPU VM?
How do I properly make sure I'm using of all 32 cores from the TPU VM?

Context

Below I describe the steps I followed for setting up the TPU VM v3-32:

Create the TPU VM v3-32:

gcloud alpha compute tpus tpu-vm create $TPU_NAME --node-id $TPU_NAME --project $PROJECT_ID --zone $ZONE --accelerator-type v3-32 --runtime-version tpu-vm-tf-2.14.1-pod

Once created, ssh into the TPU VM and set the environment variables:

export TPU_NAME="your-tpu-name"
export TPU_LIBRARY_PATH=/lib/libtpu.so
export TPU_LOAD_LIBRARY=0
export PROJECT_ID="specified-project-id"
export ZONE="specified-zone"

Install the tensorflow build provided in the TPU VM:

pip install /usr/share/tpu/tensorflow-2.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Install remaining dependencies from requirements.txt.
Run the code sample from the documentation too test the the TPU is working. I got the same output but I see an additional info log message indicating that not tpu was found although the TPUStrategy displays correctly the nº of devices.

I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:28] FindAndLoadTpuLibrary failed with FAILED_PRECONDITION: TPU_LOAD_LIBRARY=0, not loading libtpu. This is expected if TPU is not used.
I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:73] No TPU platform found.
All TPU devices: []

2024-02-06 10:06:28.415115: I tensorflow/compiler/xla/stream_executor/tpu/tpu_initializer_helper.cc:242] Libtpu path is: /lib/libtpu.so
2024-02-06 10:06:28.455424: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:28] FindAndLoadTpuLibrary failed with FAILED_PRECONDITION: TPU_LOAD_LIBRARY=0, not loading libtpu. This is expected if TPU is not used.
2024-02-06 10:06:28.458234: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Tensorflow version 2.14.1
2024-02-06 10:06:32.318797: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (4 tries left)
2024-02-06 10:06:33.318914: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (3 tries left)
2024-02-06 10:06:34.319049: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (2 tries left)
2024-02-06 10:06:35.319189: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:76] No TPU platform registered. Waiting 1 second and trying again... (1 tries left)
2024-02-06 10:06:36.319326: I tensorflow/compiler/xla/stream_executor/tpu/tpu_platform_interface.cc:73] No TPU platform found.
All TPU devices: []
2024-02-06 10:06:36.730063: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://localhost:56786
Number of devices: 32
PerReplica:{
  0: tf.Tensor(2.0, shape=(), dtype=float32),
  1: tf.Tensor(2.0, shape=(), dtype=float32),
  2: tf.Tensor(2.0, shape=(), dtype=float32),
  3: tf.Tensor(2.0, shape=(), dtype=float32),
  4: tf.Tensor(2.0, shape=(), dtype=float32),
  5: tf.Tensor(2.0, shape=(), dtype=float32),
  6: tf.Tensor(2.0, shape=(), dtype=float32),
  7: tf.Tensor(2.0, shape=(), dtype=float32),
  8: tf.Tensor(2.0, shape=(), dtype=float32),
  9: tf.Tensor(2.0, shape=(), dtype=float32),
  10: tf.Tensor(2.0, shape=(), dtype=float32),
  11: tf.Tensor(2.0, shape=(), dtype=float32),
  12: tf.Tensor(2.0, shape=(), dtype=float32),
  13: tf.Tensor(2.0, shape=(), dtype=float32),
  14: tf.Tensor(2.0, shape=(), dtype=float32),
  15: tf.Tensor(2.0, shape=(), dtype=float32),
  16: tf.Tensor(2.0, shape=(), dtype=float32),
  17: tf.Tensor(2.0, shape=(), dtype=float32),
  18: tf.Tensor(2.0, shape=(), dtype=float32),
  19: tf.Tensor(2.0, shape=(), dtype=float32),
  20: tf.Tensor(2.0, shape=(), dtype=float32),
  21: tf.Tensor(2.0, shape=(), dtype=float32),
  22: tf.Tensor(2.0, shape=(), dtype=float32),
  23: tf.Tensor(2.0, shape=(), dtype=float32),
  24: tf.Tensor(2.0, shape=(), dtype=float32),
  25: tf.Tensor(2.0, shape=(), dtype=float32),
  26: tf.Tensor(2.0, shape=(), dtype=float32),
  27: tf.Tensor(2.0, shape=(), dtype=float32),
  28: tf.Tensor(2.0, shape=(), dtype=float32),
  29: tf.Tensor(2.0, shape=(), dtype=float32),
  30: tf.Tensor(2.0, shape=(), dtype=float32),
  31: tf.Tensor(2.0, shape=(), dtype=float32)
}

Original Q&A

All compute done in the host rather than on the devices when training with TPU v3-32 using Tensorflow 2.14.1

Description

Questions

Context

There are 0 best solutions below

Related Questions in GOOGLE-COMPUTE-ENGINE

Related Questions in GOOGLE-CLOUD-TPU

Trending Questions

Popular # Hahtags

Popular Questions