document similarity search - annoy & pysparNN

648 Views Asked by I-PING Ou At 17 May 2025 at 14:56

I am trying to find a solution for finding nearest or approximate nearest neighbor of documents.

Right now I am using tfidf as vector representation of the document. My data is pretty big (N ~ million). If I use annoy with tfidf, I ran out of memory. I figured it's because of tfidf's high dimensionality(my vocabulary is about 2000000 Chinese words).

I then tried it with pysparNN, which works great. However my concern is as my data size grow, pysparNN build a bigger index, and eventually it might not fit into RAM. This is ab problem because pysparNN does not use a static file like annoy does.

I am wondering what might be a good solution for finding nearest neighbor for text data. Right now I am looking into using gensim's annoy index, with doc2ve

Original Q&A

There are 1 best solutions below

shoegazerstella On 27 March 2019 at 14:12

I don't find tfidf to be a great solution when it comes to document embedding. You might try to extract more sophisticated text (doc) embeddings by using FastText, LASER, gensim, BERT, ELMO and others and then use annoy or faiss to build an index to retrieve similarities.

document similarity search - annoy & pysparNN

There are 1 best solutions below

Related Questions in NEAREST-NEIGHBOR

Related Questions in ANNOY

Trending Questions

Popular # Hahtags

Popular Questions