As, The input to transformers is essentially a sequence of tokens, each represented as one-hot vectors. These vectors are subsequently multiplied by an embedding matrix (E) to generate the input embeddings (X). This embedding matrix is a learned parameter during the training process. In mathematical terms, this process can be represented as X = E * I, where I stands for the input one-hot vectors.
so if the embedding layer just acts as look-up table to grab a learned vector representation of each token then how the embedding for word left have two different representations in the embedding space for below sentence ?
"I left my phone on the left side of the table."
The general dataflow of a transformer can be seen in the figure below:
A string is passed to a tokenizer which converts it to a numerical sequence that we call in the context of the transformers library
input_ids. These are passed to the embedding layer of the respective model to retrieve thetoken embeddings(or word embeddings) that are afterward used inside the attention layers to retrieve thecontextualized token embeddings.Let's use your given example with an actual transformer called DistilBert (most transformer architectures follow the same pricipal):
Output:
You can see that the
Embeddingsmodule contains 4 layers. Theword_embeddings-layer is a look-up table for the token embeddings. It maps the respective id from the sequence produced by the tokenizer to the embedding vector. Theposition_embeddings-layer is similar but contains vectors to encode the position (i.e. first position has a different vector than the 230 position). It is required because the attention mechanism does not have a sense of the order of the tokens.The remaining layers are a
LayerNorm- and aDropout-layer, but they are not relevant to answering OP's question and will be skipped in the following.In the following code, you can see that the
token_embeddingsof the twolefts (index 2 and 7) are identical but that the position_embeddings differ.Output:
Summary:
It's positional encoding that will differentiate the two occurrences, effectively creating two distinct representations in the embedding space. Also, due to the attention-mechanism, while processing the first "left", the model might attend more to words like "phone" and "I", whereas for the second "left", it might focus on "side" and "table". This mechanism effectively creates different contextual representations for the same word based on its surroundings.
Therefore, while the token embedding layer itself might simply be a lookup table, the combination of positional encodings and the attention mechanism allows contextual variations in word representations within a single sentence. This enables the model to capture the nuanced meaning of words based on their specific usage.