My TensorFlow training is slow. How do I profile to find the hotspots?

363 Views Asked by At

Sometimes jobs run slowly and it would be nice to be able to profile them as they run to find hotspots. How can that be down in TensorFlow, and especially in Google Cloud Machine Learning Engine?

1

There are 1 best solutions below

0
rhaertel80 On

ProfilerHook will output a trace that can be visualized using Chrome.

First, add a ProfilerHook when you instantiate Experiment:

from tensorflow.contrib import hooks

profiler_hook = hooks.ProfilerHook(save_steps=100, output_dir=args.job_dir)
experiment = tf.contrib.learn.Experiment(
              estimator=estimator,
              ...
              train_monitors=[profiler_hook])

Next, run your job as normal. While the job is running or after it is complete, copy the timelines to your local disk, e.g.

mkdir /tmp/timelines
gsutil -m cp gs://my-bucket/my-job/timeline*.json /tmp/timelines

Now, open chrome and type the following in the address bar: chrome://tracing.

Hit the Load button, and search for a particular timeline.json file to load.

Look for "bars" on the graph that take a long time. Click on them to get more information.