Can somebody explain to me the shape and value of Tensorflow's GradientTape output when target is not a scalar value? For example, I had the following code:
import tensorflow as tf
a = tf.Variable([[-1.], [0.], [1.]])
b = tf.Variable([[1.,2.,3.],[4.,5.,6.]])
with tf.GradientTape() as g:
c = b @ a
grads = g.gradient(c, a)
print(c)
print(grads)
The value of c is [[2.],[2.]]. The value of grads is [[5.],[7.],[9.]].
I expected the value of grads to have shape (3,2) or (2,3), and contain values of partial derivatives of each entry of c with respect to a. I am not sure what the values of 5, 7, and 9 represent (interestingly, it seems to be the gradients as if c had been tf.reduce_sum(b @ a) instead)
The documentation that I found doesn't really explain the output.
Because your output is non-scalar you compute the Jacobian matrix gradients.
These gradients are accumulated (ie. summed) across dimensions so that the gradients have the same shape as the values array of the tensor. So that when we apply them, we can easily subtract them from our values.
If you don't want this to happen, or you just want to see how the gradients are accumulated, you can do the following (in tf 2.7 and up):
Here you end up with a gradient tensor like:
[[1, 2, 3], [4, 5, 6]](I dropped the singleton dimensions) which is more like what you expected. What gradient tape then does by default is sum across these dimensions which gives us[5, 7, 9].