What's the exact input size in MultiHead-Attention of BERT?

16 Views Asked by At

I just recently learned BERT.

Some tutorials show that after embedding a sentence, a matrix X of [seq_len, 768] will be formed, and X will be sent to MultiHead_Attention, that is, multiple Self-Attentions.

But in fasterTransformer, why is the input [seq_len, head_num, size_per_head]? It seems that it divides the matrix X equally according to the number of heads and sends it to each head, instead of the complete matrix X.

So what is the real input?

0

There are 0 best solutions below