Index the entire DB into a single document using Lucene

133 Views Asked by At

I am working on improving the performance of existing ASP.Net application and lessen the database hits for each search criteria click on a page. In the process i am trying to implement the Lucene.Net .

but strange thing is i am trying to index using a "select *" statement on a table which is having millions of records, hangs at DB level itself.

Then how it is possible to get the entire "select *" results into a single document with lesser time without making the application hanged, from there i can apply search filters on the document nad show up in the grid.

Thanks in advance

1

There are 1 best solutions below

0
On

When indexing millions of records in Lucene.NET you will need to break up the process. What you are trying to do is read all of the data up front, have it sit in memory, then have Lucene.NET take all of that read data and then build a massive index. It simply will fall apart with large data sets. You need to break the process up into a "buffered" architecture.

What I did in the past is..and what you could do for example:

  • break the select statement into a stored procedure that returned pieces the millions of records. For example, if I had 100 million records it would return: 25 million four times
  • I also used four different threads to read the data. Then you start an asynchronous queue that as soon as the data is read from the database it gets fed into the queue buffer. Read up on BlockingQueues in .NET
  • Then you have another series of threads reading the data from the queue and then they will pipe that into the Lucene index building process
  • the last step is to build the indexes (from the previous step) in parallel and the use the Lucene.NET merge option to merge all of the data into one big index

I have found this architecture above scalable, as you can run as many threads (read and build) as you have cores. It is also cloud scalable, because you can use Azure Worker Roles and Queues to scale this to many many machines if you have a super huge index.