I am new to Natural Language Processing and currently working on machine translation using ALMA-7B model from Hugging Face. I wanted to create custom tokenizer based on the tokens that I have in my Word2Vec Embeddings and I also have their corresponding Embeddings (weights) with me. I am adding the tokens to tokenizers using following code:
alma_tokenizer.add_tokens(word_chunks)
Where alma_tokenizer is the Tokenizer for ALMA-7B model and word_chunks is a list of words I want to add. I want to update the model with its corresponding word embeddings as well, in the model and I was suggested to use resize_token_embeddings() function of AutoModelForCausalLM. When used it actually created new embeddings for the tokens I had added and I confirmed it as well. But my question is how are these embeddings created? Are they created randomly (as they are not a tensor of zeroes)? Can I insert my embeddings instead of the embeddings created by them?
Any kind of help will be appreciated!
embeddings=model.resize_token_embeddings(len(tokenizer))
transformers.modeling_utils.PreTrainedModel.resize_token_embeddings(https://github.com/huggingface/transformers/blob/38611086d293ea4a5809bcd7fadd8081d55cb74e/src/transformers/modeling_utils.py#L1855C14-L1855C27)._get_resized_embeddingsis finally called andModel._init_weightswill be used to initialize new embedding. Thannew_embeddings.weight.data[:n, :] = old_embeddings.weight.data[:n, :]will make sure the old token embedding remains the same.As far as I know ALMA shares same architecture as Llama. Below is the
_init_weightfunction intransformers.models.llama.modeling_llama:For ALMA, new token embedding will be initialized with normal distribution of mean=0 and var=std (which defined in model config)
Of course you can insert your embeddings.
Method 1 rewrite model._init_weights
Method 2 do it manually
If you do it manually, Don't forget to resize lm_head either. You may need to update parameters in model.config as well