I have a question in word embeddings using transformer encoder models. Let's create word embeddings using the BERT model.
Word 1: "cat" (em1)
Word 2: "dog" (em2)
Word 3: "driver" (em3)
Word 4: "lion" (em4)
Let's take the cosine similarity score: (below cosine scores are not real , just for the sake of an example)
cs(em1, em2) = 0.90
cs(em1, em3) = 0.70
cs(em1, em4) = 0.73
The cosine scores always lie between 70 and 1. They do not go below 70 (for example, the cosine similarity score of "cat" and "driver").
After applying PCA and reducing the dimension from 768 to 50 (reducing the variance in the vector), the score is now below 70.
My question is: high variance keeps a lot of information about the word, right? But the cosine score always lies between 70 and 100. Can anyone please tell me why this is happening? And why, when I reduce the dimension, the score goes below 70? This problem occurs not only with BERT, but with every transformer-based model.
Thanks in advance for your help!