document similarity search - annoy & pysparNN

648 Views Asked by At

I am trying to find a solution for finding nearest or approximate nearest neighbor of documents.

Right now I am using tfidf as vector representation of the document. My data is pretty big (N ~ million). If I use annoy with tfidf, I ran out of memory. I figured it's because of tfidf's high dimensionality(my vocabulary is about 2000000 Chinese words).

I then tried it with pysparNN, which works great. However my concern is as my data size grow, pysparNN build a bigger index, and eventually it might not fit into RAM. This is ab problem because pysparNN does not use a static file like annoy does.

I am wondering what might be a good solution for finding nearest neighbor for text data. Right now I am looking into using gensim's annoy index, with doc2ve

1

There are 1 best solutions below

0
On

I don't find tfidf to be a great solution when it comes to document embedding. You might try to extract more sophisticated text (doc) embeddings by using FastText, LASER, gensim, BERT, ELMO and others and then use annoy or faiss to build an index to retrieve similarities.