I have a FastAPI where I use pgvector-python to store embeddings from an LLM.
Storing data works superfine, but I struggle to figure out, how to retrieve data based on the cosinesimilarity in an efficient way.
Say I have a request of three new embeddings data_new = [[1,2,3],[3,2,1],[1,2,1]] and I want to get the data from my database, which has the highest similarity with each embedding in data_new.
I could loop over data_new and run some raw sql-code e.g
data_new = [[1,2,3],[3,2,1],[1,2,1]]
TOP_DOCS = []
engine = create_engine() #Returns SQLAlchemy Engine
for data in data_new:
query = f"SELECT * FROM embeddings ORDER BY embedding <=> {data} LIMIT 1"
res = engine.execute(query)
TOP_DOCS.append(res.fetchall())
but I doubt that is the most efficient way.
Isn't there a way using pgvector-python to optimize it, or do we really need to fetch the data like this?
If it matters my FastAPI model looks like:
from base import Base
from sqlalchemy import mapped_column
from pgvector.sqlalchemy import Vector
class OcrModel(Base):
__tablename__ = "embedding"
user_id= Column(int, primary_key=True, index=True)
embedding = mapped_column(Vector(1536))
If you're asking about some type of batch query, I don't believe those are not supported rn. All that it would be doing is going from one process of doing a shit ton of float operations to two processes and so on. For now the only way to speed up an ANN search is by adjusting the index conf, the data you put into it, search params, and distance calculation. But for text embedding, you almost have to use cosine similarity unless the embedding size/dimension is small.
Use an Index.
They offer a IVFFlat and HNSW Index, they both have their pros and cons, but both are better than raw dogging.
Distance Calculation Trade-Offs
L2 norm (Euclidean Distance), Inner Product, and Cosine Similarity. From the python code you can see which ones are the easiest to compute.
On pgvectors repo they go over all of their syntax and structures.
To see some python examples look at pgvector-python examples