In standard ANN for fully connected layers we are using the following formula: tf.matmul(X,weight) + bias. Which is clear to me, as we use matrix multiplication in order to connect input with th hidden layer.
But in GloVe implementation(https://nlp.stanford.edu/projects/glove/) we are using the following formula for embeddings multiplication: tf.matmul(W, tf.transpose(U)) what confuses me is tf.transpose(U)part.
Why do we use tf.matmul(W, tf.transpose(U)) instead of tf.matmul(W, U)?
It has to do with the choice of column vs row orientation for the vectors.
Note that
weightis the second parameter here:But the first parameter,
W, here:So what you are seeing is a practical application of the following matrix transpose identity:
To bring it back to your example, let's assume 10 inputs and 20 outputs.
The first approach uses row vectors. A single input
Xwould be a1x10matrix, called a row vector because it has a single row. To match, theweightmatrix needs to be10x20to produce an output of size20.But in the second approach the multiplication is reversed. That is a hint that everything is using column vectors. If the multiplication is reversed, then everything gets a transpose. So this example is using column vectors, so named because they have a single column.
That's why the transpose is there. The way they GLoVe authors have done their notation, with the multiplication reversed, the weight matrix
Wmust already be transposed to20x10instead of10x20. And they must be expecting a20x1column vector for the output.So if the input vector
Uis naturally a1x10row vector, it also has to be transposed, to a10x1column vector, to fit in with everything else.Basically you should pick row vectors or column vectors, all the time, and then the order of multiplications and the transposition of the weights is determined for you.
Personally I think that column vectors, as used by GloVe, are awkward and unnatural compared to row vectors. It's better to have the multiplication ordering follow the data flow ordering.