How to use a biomedical model from Huggingface to get text embeddings?

278 Views Asked by At

I have biomedical text that I'm trying to get the embeddings for using a biomedical transformer:

my_text = ["Chocolate has a history of human consumption tracing back to 400 AD and is rich in polyphenols such as catechins, anthocyanidins, and pro anthocyanidins. As chocolate and cocoa product consumption, along with interest in them as functional foods, increases worldwide, there is a need to systematically and critically appraise the available clinical evidence on their health effects. A systematic search was conducted on electronic databases such as MEDLINE, EMBASE, and Cochrane Central Register of Controlled Trials (CENTRAL) using a search strategy and keywords. Among the many health effects assessed on several outcomes (including skin, cardiovascular, anthropometric, cognitive, and quality of life), we found that compared to controls, chocolate or cocoa product consumption significantly improved lipid profiles (triglycerides), while the effects of chocolate on all other outcome parameters were not significantly different. In conclusion, low-to-moderate-quality evidence with short duration of research (majority 4-6 weeks) showed no significant difference between the effects of chocolate and control groups on parameters related to skin, blood pressure, lipid profile, cognitive function, anthropometry, blood glucose, and quality of life regardless of form, dose, and duration among healthy individuals. It was generally well accepted by study subjects, with gastrointestinal disturbances and unpalatability being the most reported concerns."]

I found that I can use sentence-transformers to get embeddings for text pretty easily (I assume I can just average the sentence embeddings over all sentences). I found this SO answer that use the same framework and seems like applicable with any (unless I'm wrong) biomedical model (e.g., this):

from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')
document_embeddings = sbert_model.encode(pd.Series(['hello', 'cell type', 'protein']))
document_embeddings 

But when I run the code I get

No sentence-transformers model found with name /home/user/.cache/torch/sentence_transformers/microsoft_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /home/user/.cache/torch/sentence_transformers/microsoft_BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

If I understand correctly this means that some of the weights from the model are either not used or randomly initialized, which means that I can't trust these generated embeddings.

What is the correct way to do this, if say, I want to use that PubMedBERT model, or another one like BioBERT?

1

There are 1 best solutions below

0
viboognesh On
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")
model = AutoModelForMaskedLM.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")

Load the model on your system and then convert your text to embeddings using the below code.

import torch

# Assuming my_text is your input text
my_text = ["Chocolate has a history of human consumption tracing back to 400 AD and is rich in polyphenols such as catechins, anthocyanidins, and pro anthocyanidins. As chocolate and cocoa product consumption, along with interest in them as functional foods, increases worldwide, there is a need to systematically and critically appraise the available clinical evidence on their health effects. A systematic search was conducted on electronic databases such as MEDLINE, EMBASE, and Cochrane Central Register of Controlled Trials (CENTRAL) using a search strategy and keywords. Among the many health effects assessed on several outcomes (including skin, cardiovascular, anthropometric, cognitive, and quality of life), we found that compared to controls, chocolate or cocoa product consumption significantly improved lipid profiles (triglycerides), while the effects of chocolate on all other outcome parameters were not significantly different. In conclusion, low-to-moderate-quality evidence with short duration of research (majority 4-6 weeks) showed no significant difference between the effects of chocolate and control groups on parameters related to skin, blood pressure, lipid profile, cognitive function, anthropometry, blood glucose, and quality of life regardless of form, dose, and duration among healthy individuals. It was generally well accepted by study subjects, with gastrointestinal disturbances and unpalatability being the most reported concerns."]

# Tokenize the text
encoded_input = tokenizer(my_text, padding=True, truncation=True, return_tensors='pt')

# Pass the tokenized input through the model
model_output = model(**encoded_input)

# Extract the embeddings
embeddings = model_output.last_hidden_state

# If you want to convert the embeddings to numpy array
embeddings_np = embeddings.detach().numpy()

# Now, embeddings_np contains the embeddings for your text