I'm using Langchain to load a document, split it into chunks, embed those chunks, embed them and then store the embedding vectors into a langchain VectorStore database. My use case requires me to run an algorithm on the embedding vectors, which i have been trying to find a way to fetch but to no avail.
My idea is to be able to do something like this:
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import SomeVectorStore
from langchain_openai import OpenAIEmbeddings
loader = TextLoader("../document.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = SomeVectoreStore.from_documents(docs, embeddings)
# get all the embeddings and their corresponding chunks from the db
embeddings_and_thei_chunks = db.some_way_to_get_all_embeddings()
The exact method to retrieve embeddings from a VectorStore would depend on the specific implementation of the VectorStore you're using. However, most vector stores should provide a way to iterate over the stored vectors. Assuming
SomeVectorStorehas a methoditems()that returns an iterator over (key, value) pairs, where key is the chunk and value is the corresponding embedding, you could do something like this:If
SomeVectorStoredoes not provide such a method, you would need to check the documentation or the source code of the VectorStore to find out how to retrieve the stored vectors.If there's no built-in way to retrieve all vectors, you might need to keep track of the keys (i.e., the chunks) that you're storing in the VectorStore, and then use those keys to retrieve the vectors later. For example:
Again, the exact details would depend on the specific VectorStore you're using.