create custom embedding function in chromadb for semantic search

234 Views Asked by At

I have the python 3 code below. I have chromadb vector database and I'm trying to create embeddings for chunks of text like the example below, using a custom embedding function. My end goal is to do semantic search of a collection I create from these text chunks. So I'm upserting the text chunks along with embeddings and metadata into the chromadb collection, and then querying the collection. What I'm wondering is if I'm creating the custom embedding function correctly.

The documentation for creating a custom embedding function can be found here:

https://docs.trychroma.com/embeddings

I want to use the huggingface sentence transformer model named all-MiniLM-L12-v2. It's documentation can be found here:

https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2

(I know it's possible to switch to all_MiniLM-L12-v2 in the SentenceTransformerEmbeddingFunction, but I'm building a function with it to better understand how you create custom embedding functions.)

What I'm wondering is if I need to split the text up into a comma separated list of sentences and pass that model.encode, in MyEmbeddingFunction? I created the custom function below by combining the two code sources mentioned in the documentation listed above. In the all-MiniLM-L12-v2 documentation page example code, the input passed to model.encode is a comma separated list of sentences. But in my code I'm passing it entire paragraphs like the one below. Would querying the collection work better if I split the example below into a list of comma separated sentences and passed the list in to model.encode?

I also have my code and results of a query below.

example:

"An email with title: Urgent || Data Scientist/Engineer || Location - Las Vegas, NV was sent to job seeker Jerome Powell on Tuesday, August 22, 2023 at 06:54 AM PDT. It was for the position of Data Scientist/Engineer. It's location was Las Vegas, NV. The employment type was contract. It had the required skills: statistical programming languages, R, Python, sql, hive, pig, scala, java, C++, statistics, statistical tests, distributions, regression, maximum likelihood estimators, machine learning,k-Nearest Neighbors, Naive Bayes, SVM, Decision Forests, Data Wrangling, Data Visualization, matplotlib, ggplot, d3.js., Tableau, Communication Skills, Software Engineering, Problem-solving, analytical, degree."

code:

# creating custom embeddings with non-default embedding model

from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents
        
        from sentence_transformers import SentenceTransformer

        sentences = input
    
        model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2',
                                   device='cuda',
                           use_auth_token=hf_auth,
                           cache_folder='/home/username/stuff/username_storage/LLM/weights/huggingface/hub/')
        embeddings = model.encode(sentences)


        # Convert embeddings to a list of lists
        embeddings_as_list = [embedding.tolist() for embedding in embeddings]
        
        return embeddings_as_list
        
        
        
custom_embeddings=MyEmbeddingFunction()

test_collection = chroma_client\
.get_or_create_collection(name="test_custom_embeddings",
                          embedding_function=custom_embeddings
                         
                         )
                         
                         
# inserting data

test_collection.upsert(
    ids=[f"{x}" for x in summary_df['id'].tolist()],
    documents=summary_df['summary'].tolist(),
    metadatas=summary_df['meta'].tolist()    
)


qry_str = """Title contains Data Scientist"""


db_query_results=test_collection.query(query_texts=qry_str, n_results=2)

result_summaries=[x['summary'] for x in db_query_results['metadatas'][0]]

result_summaries  

output:

["An email with title: Urgent || Data Scientist/Engineer || Location - Las Vegas, NV was sent to job seeker Jerome Powell on Tuesday, August 22, 2023 at 06:54 AM PDT. It was for the position of Data Scientist/Engineer. It's location was Las Vegas, NV. The employment type was contract. It had the required skills: statistical programming languages, R, Python, sql, hive, pig, scala, java, C++, statistics, statistical tests, distributions, regression, maximum likelihood estimators, machine learning,k-Nearest Neighbors, Naive Bayes, SVM, Decision Forests, Data Wrangling, Data Visualization, matplotlib, ggplot, d3.js., Tableau, Communication Skills, Software Engineering, Problem-solving, analytical, degree.", "An email with title: Lead Data Scientist - O'Fallon, MO (Hybrid) was sent to job seeker Jerome Powell on Tuesday, August 22, 2023 at 07:16 AM PDT. It was for the position of Lead Data Scientist. It's location was O'Fallon, MO (Hybrid). The employment type was contract. It had the required skills: Masters or PhD in mathematics, statistics, computer science, or related fields, lead large data science projects, research, communication skills, predictive, batch, streaming, python, R, hadoop, spark, MySQL, anomaly detection, supervised learning, unsupervised learning, time-series, natural language processing, Numpy, SciPy, Pandas, Scikit-learn, Tensorflow, Keras, NLTK, Gensim, BERT, NetworkX, organized, self motivated, data visualization."]

0

There are 0 best solutions below