We have a set of JSON files in a folder that need to be indexed in ChromaDB using a Python script. The indexed data will later be used for similarity search, with the obtained details serving as context for ChatGPT. The target data for indexing is located at ".dataArr[].data" in our JSON files (see the sample JSON at the bottom).
The ChromaDB object is created with persist_directory to ensure the index is persisted for future use. We are open to using various embedding functions, such as SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") or OpenAIEmbeddings().
Methods Used:
Method 1: We utilized the langchain library for the following steps:
Following shows only the relevant snippet of the entire code
from langchain_community.vectorstores import Chroma
from langchain.schema.document import Document
from langchain.loaders import DirectoryLoader, JSONLoader
loader = DirectoryLoader(json_folder, glob="**/*.json", show_progress=True, use_multithreading=False, loader_cls=JSONLoader, loader_kwargs={'jq_schema': '.dataArr[].data'}, silent_errors=False)
docs = loader.load()
if len(docs) > 0:
ndex = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory": data_folder}).from_documents(docs)
In this method, no specific embedding function is mentioned, and the OpenAIEmbeddings() is used by default.
Method 2: We employed the following steps using the standard Python json module:
Following shows only the relevant snippet of the entire code
from langchain_community.vectorstores import Chroma from langchain.schema.document import Document
with open(filepath, "r") as f: data = json.load(f) docs = []
for item in data:
doc = Document(page_content=item['data'])
docs.append(doc)
chroma = Chroma(persist_directory=data_folder, embedding_function=embedding_function)
chroma.add_documents(docs)
chroma.persist()
Issues Faced: When the file chroma.sqlite3 was created using "Method1 '', the last "dataArr[].data" (refer sample json at the end of this document), got split automatically and saved as multiple documents in chromaDB. So it creates issues when similarity search is performed, like giving back partial data in the response
Upon executing Method2, the chroma.sqlite3 file is successfully saved to the specified persist_directory. However, when attempting similarity searches on the indexed data, accurate results are not obtained. Multiple attempts with different embedding functions and indexing each JSON item as individual documents (to avoid breaking in between) did not resolve the issue.
We have also tried using “RecursiveJsonSplitter” to split the json to documents and then add them to chromaDB using chromadb.add_documents(docs). But that also did not solve our issue.
Request for Assistance: We are actively exploring alternative methods and seeking insights to identify potential issues in the approaches outlined above. Any suggestions or insights from your expertise would be invaluable and greatly appreciated.
I am really in a blocked state. Can someone help please