Using pgvector-python in FastAPI to get most similar data

530 Views Asked by At

I have a FastAPI where I use pgvector-python to store embeddings from an LLM. Storing data works superfine, but I struggle to figure out, how to retrieve data based on the cosinesimilarity in an efficient way.

Say I have a request of three new embeddings data_new = [[1,2,3],[3,2,1],[1,2,1]] and I want to get the data from my database, which has the highest similarity with each embedding in data_new.

I could loop over data_new and run some raw sql-code e.g

data_new = [[1,2,3],[3,2,1],[1,2,1]]
TOP_DOCS = []
engine = create_engine() #Returns SQLAlchemy Engine
for data in data_new:
    query = f"SELECT * FROM embeddings ORDER BY embedding <=> {data} LIMIT 1"
    res = engine.execute(query)
    TOP_DOCS.append(res.fetchall())

but I doubt that is the most efficient way.

Isn't there a way using pgvector-python to optimize it, or do we really need to fetch the data like this?

If it matters my FastAPI model looks like:

from base import Base
from sqlalchemy import mapped_column
from pgvector.sqlalchemy import Vector

class OcrModel(Base):
    __tablename__ = "embedding"
    user_id= Column(int, primary_key=True, index=True)
    embedding = mapped_column(Vector(1536))
1

There are 1 best solutions below

0
Carter Susi On

If you're asking about some type of batch query, I don't believe those are not supported rn. All that it would be doing is going from one process of doing a shit ton of float operations to two processes and so on. For now the only way to speed up an ANN search is by adjusting the index conf, the data you put into it, search params, and distance calculation. But for text embedding, you almost have to use cosine similarity unless the embedding size/dimension is small.


  1. Use an Index.
    They offer a IVFFlat and HNSW Index, they both have their pros and cons, but both are better than raw dogging.

  2. Distance Calculation Trade-Offs
    L2 norm (Euclidean Distance), Inner Product, and Cosine Similarity. From the python code you can see which ones are the easiest to compute.

def l2_norm(vec1, vec2):
  return math.sqrt(sum([(a - b) ** 2 for a, b in zip(vec1, vec2)]))

def inner_product(vec1, vec2):
  return sum(x * y for x, y in zip(vec1, vec2))

def cosine_similarity(vec1, vec2):
  return (sum(x * y for x, y in zip(vec1, vec2))) / ((sum(x**2 for x in vec1) ** 0.5) * (sum(y**2 for y in vec2) ** 0.5))
  1. Increase index list amount
# For 100,000 vectors, default is n_vecs / 1000
#standard
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

#faster, less accurate
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000);
  1. Analyze performance
EXPLAIN ANALYZE SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
  1. Think of ways to get results without running an ANN search. Store the embedding of your top 100 queries and their returns, etc...

On pgvectors repo they go over all of their syntax and structures.
To see some python examples look at pgvector-python examples