query chromadb vector database

112 Views Asked by At

I have the python 3 code below. In it I am creating a chromadb vector database. I'm creating a collection and upserting some vectors into it. I'm then querying the collection. The results I'm getting for the query aren't that good. It's a pretty basic query. I'm wondering if there's anything fundamentally incorrect about my code? Does anyone have any suggestions on how to improve the results? Also I was wondering what chromadb uses by default to query the article based on the prompt. For example is it cosine similarity or some type of KNN?

code:

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np
import chromadb
from chromadb.config import Settings



# Create chromadb vector database
chroma_client = chromadb.PersistentClient(path="RAG-Example-chroma-db")

# # collection creation
article_collection = chroma_client.create_collection(name="email-summaries")



# To do:
# read json summaries in

import json

with open(r'/home/scotsditch/stuff/scotsditch_storage/LLM/RAG/data/summaries.json') as f:
    summaries_data = json.load(f)
    
# convert to dataframe
summary_df=pd.DataFrame(summaries_data['summaries'])

# create meta data for vector db

summary_df['meta'] = summary_df.apply( lambda x: {
    'id': x['id'],
    'summary': x['summary']  
}, axis=1)
    

# get already existing collection
article_collection = chroma_client.get_or_create_collection(name="email-summaries")


# inserting data

article_collection.upsert(
    ids=[f"{x}" for x in summary_df['id'].tolist()],
    documents=summary_df['summary'].tolist(),
    metadatas=summary_df['meta'].tolist()    
)


# query chroma db collection

qry_str = """What Data Scientist jobs were emailed to Danny Trejo."""

db_query_results=article_collection.query(query_texts=qry_str, n_results=2)

result_summaries=[x['summary'] for x in db_query_results['metadatas'][0]]

result_summaries

output:

["An email with title: W2 Contract //Data Analyst // Remote (Only PST Candidate ) was sent to job seeker Danny Trejo on Tuesday, August 22, 2023 at 11:40 AM PDT. It was for the position of Data Analyst. It's location was Remote ( West Coast). The employment type was Contract. It had the required skills: SQL, Azure, Power BI, DataBricks, Elicit Requirements, Analytics, Reporting, healthcare, TSQL, Power BI, Data Visualization, Synapse, NLP, R, Python, AI.", "An email with title: Lead Data Scientist - O'Fallon, MO (Hybrid) was sent to job seeker Danny Trejo on Tuesday, August 22, 2023 at 07:16 AM PDT. It was for the position of Lead Data Scientist. It's location was O'Fallon, MO (Hybrid). The employment type was contract. It had the required skills: Masters or PhD in mathematics, statistics, computer science, or related fields, lead large data science projects, research, communication skills, predictive, batch, streaming, python, R, hadoop, spark, MySQL, anomaly detection, supervised learning, unsupervised learning, time-series, natural language processing, Numpy, SciPy, Pandas, Scikit-learn, Tensorflow, Keras, NLTK, Gensim, BERT, NetworkX, organized, self motivated, data visualization."]

0

There are 0 best solutions below