I have a frozen model and 4 gpus. I would like to perform inference on as much data as fast as possible. I basically want to execute data parallelism where the same model is performing inference on 4 batches: one batch for each gpu.
This is what I am roughly trying to do
def return_ops():
# load the graph
with tf.Graph().as_default() as graph:
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(model_path, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name='')
inputs = []
outputs = []
with graph.as_default() as g:
for gpu in ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']:
with tf.device(gpu):
image_tensor = g.get_tensor_by_name('input:0')
get_embeddings = g.get_tensor_by_name('embeddings:0')
inputs.append(image_tensor)
outputs.append(get_embeddings)
return inputs, outputs, g
However, when I run
#sample batch
x = np.ones((100,160,160,3))
# get ops
image_tensor_list, pt_list, emb_list, graph = return_ops()
# construct feed dict
feed_dict = {it: x for it in image_tensor_list}
# run the ops
with tf.Session(graph=graph, config=tf.ConfigProto(allow_soft_placement=True)) as sess:
inf = sess.run(emb_list, feed_dict=feed_dict)
Everything is running on /gpu:0 when inspecting using nvidia-smi.
I can, however, run
with tf.device("/gpu:1"):
t = tf.range(1000)
with tf.Session() as sess:
sess.run(t)
and there is activity on the second gpu...
How can I implement this data parallelism task properly?
I learned that the placement of tensors on GPU needs to occur when importing the graph_def. The code below returns ops that I can then run with
sess.run([output1, ..., outputk], feed_dict). It will place all operations on the gpu, which is not ideal, therefore I passallow_soft_placementto be true for the session config.