How to understand contextualized embeddings in Transformer?

Question

How to understand contextualized embeddings in Transformer?

380 Views Asked by Vinay Sharma At 05 December 2023 at 10:58

As, The input to transformers is essentially a sequence of tokens, each represented as one-hot vectors. These vectors are subsequently multiplied by an embedding matrix (E) to generate the input embeddings (X). This embedding matrix is a learned parameter during the training process. In mathematical terms, this process can be represented as X = E * I, where I stands for the input one-hot vectors.

so if the embedding layer just acts as look-up table to grab a learned vector representation of each token then how the embedding for word left have two different representations in the embedding space for below sentence ?

"I left my phone on the left side of the table."

Original Q&A

There are 1 best solutions below

**Vinay Sharma** · Answer 1 · 2023-12-06T07:30:37.010000

The general dataflow of a transformer can be seen in the figure below: A string is passed to a tokenizer which converts it to a numerical sequence that we call in the context of the transformers library input_ids. These are passed to the embedding layer of the respective model to retrieve the token embeddings (or word embeddings) that are afterward used inside the attention layers to retrieve the contextualized token embeddings.

Let's use your given example with an actual transformer called DistilBert (most transformer architectures follow the same pricipal):

import torch
from transformers import DistilBertTokenizerFast, DistilBertModel

model_id = "distilbert-base-uncased"

t = DistilBertTokenizerFast.from_pretrained(model_id)
input_sequence = t("I left my phone on the left side of the table.", return_tensors="pt")
m = DistilBertModel.from_pretrained(model_id)
print(m)

Output:

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
          (activation): GELUActivation()
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
    )
  )
)

You can see that the Embeddings module contains 4 layers. The word_embeddings-layer is a look-up table for the token embeddings. It maps the respective id from the sequence produced by the tokenizer to the embedding vector. The position_embeddings-layer is similar but contains vectors to encode the position (i.e. first position has a different vector than the 230 position). It is required because the attention mechanism does not have a sense of the order of the tokens.

The remaining layers are a LayerNorm- and a Dropout-layer, but they are not relevant to answering OP's question and will be skipped in the following.

In the following code, you can see that the token_embeddings of the two lefts (index 2 and 7) are identical but that the position_embeddings differ.

print(input_sequence.input_ids)
print(input_sequence.tokens())
position_ids = torch.arange(len(input_sequence.input_ids[0])).unsqueeze(0)

with torch.inference_mode():
  token_embeddings = m.embeddings.word_embeddings(input_sequence.input_ids)
  position_embeddings = m.embeddings.position_embeddings(position_ids)

print(torch.allclose(token_embeddings[:, 2], token_embeddings[:, 7]))
print(torch.allclose(position_embeddings[:, 2], position_embeddings[:, 7]))

Output:

tensor([[ 101, 1045, 2187, 2026, 3042, 2006, 1996, 2187, 2217, 1997, 1996, 2795,
         1012,  102]])
['[CLS]', 'i', 'left', 'my', 'phone', 'on', 'the', 'left', 'side', 'of', 'the', 'table', '.', '[SEP]']
True
False

Summary:

It's positional encoding that will differentiate the two occurrences, effectively creating two distinct representations in the embedding space. Also, due to the attention-mechanism, while processing the first "left", the model might attend more to words like "phone" and "I", whereas for the second "left", it might focus on "side" and "table". This mechanism effectively creates different contextual representations for the same word based on its surroundings.

Therefore, while the token embedding layer itself might simply be a lookup table, the combination of positional encodings and the attention mechanism allows contextual variations in word representations within a single sentence. This enables the model to capture the nuanced meaning of words based on their specific usage.

How to understand contextualized embeddings in Transformer?

There are 1 best solutions below

Related Questions in NLP

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in EMBEDDING

Related Questions in TRANSFORMER-MODEL

Related Questions in WORD-EMBEDDING

Trending Questions

Popular # Hahtags

Popular Questions