How to select queries for text search benchmark?

22 Views Asked by At

I have written a text search (exact text search) algorithm which I now want to evaluate. I had a look at two research papers in that topic, but I didn't find the exact way they choose the pattern to search for.

I found The Canterbury Corpus, which is used for benchmarking. I appreciate other suggestions. How to select the patterns/queries to search for? Selection by hand seems tedious, I could generate random numbers, interpret as index and take a substring starting at the index.

Are there other/better ways to do this?

1

There are 1 best solutions below

0
Iulia Feroli On

You could take a look at the BEIR project for benchmarking: https://github.com/beir-cellar/beir

This goes into more detail in the types of tests you can run. And I believe one of the most popular datasets for evaluations is MS Marco: https://microsoft.github.io/msmarco/ They have some clearly defined retrieval tasks you can adapt for your case.

There's a leaderboard available for the best ranking search engines, so it can give you some idea of where you stand.

Hope this helps!