I have to create a utility that searches through 40 to 60 GiB of text files as quick as possible.
Each file has around 50 MB of data that consists of log lines (about 630.000 lines per file).
A NOSQL document database is unfortunately no option...
As of now I am using a Aho-Corsaick algorithm for the search which I stole from Tomas Petricek off of his blog. It works very well.
I process the files in Tasks. Each file is loaded into memory by simply calling File.ReadAllLines(path). The lines are then fed into the Aho-Corsaick one by one, thus each file causes around 600.000 calls to the algorithm (I need the line number in my results).
This takes a lot of time and requires a lot of memory and CPU.
I have very little expertise in this field as I usually work in image processing.
Can you guys recommend algorithms and approaches which could speed up the processing?
Below is more detailed view to the Task creation and file loading which is pretty standard. For more information on the Aho-Corsaick, please visit the linked blog page above.
private KeyValuePair<string, StringSearchResult[]> FindInternal(
IStringSearchAlgorithm algo,
string file)
{
List<StringSearchResult> result = new List<StringSearchResult>();
string[] lines = File.ReadAllLines(file);
for (int i = 0; i < lines.Length; i++)
{
var results = algo.FindAll(lines[i]);
for (int j = 0; j < results.Length; j++)
{
results[j].Row = i;
}
}
foreach (string line in lines)
{
result.AddRange(algo.FindAll(line));
}
return new KeyValuePair<string, StringSearchResult[]>(
file, result.ToArray());
}
public Dictionary<string, StringSearchResult[]> Find(
params string[] search)
{
IStringSearchAlgorithm algo = new StringSearch();
algo.Keywords = search;
Task<KeyValuePair<string, StringSearchResult[]>>[] findTasks
= new Task<KeyValuePair<string, StringSearchResult[]>>[_files.Count];
Parallel.For(0, _files.Count, i => {
findTasks[i] = Task.Factory.StartNew(
() => FindInternal(algo, _files[i])
);
});
Task.WaitAll(findTasks);
return findTasks.Select(t => t.Result)
.ToDictionary(x => x.Key, x => x.Value);
}
EDIT
See section Initial Answer for the original Answer.
I further optimized my code by doing the following:
pagingto prevent memory overflow / crash due to large amount of result data.offloadthe searchresults into local filesas soon as they exceed a certain buffer size (64kb in my case).SearchDatastructto binary and back.Tasksgreatly increased performance (from 35 sec to 9 sec when processing about 25 GiB of search data)Splicing / scaling the file array
The code below gives a scaled/normalized value for T_min and T_max.
This value can then be used to determine the size of each array holding n-amount of file paths.
This code shows the implementation of the scaling and splicing.
Initial Answer
I had the best results so far after switching to MemoryMappedFile and moving from the Aho-Corsaick back to Regex (a demand has been made that pattern matching is a must have).
There are still parts that can be optimized or changed and I'm sure this is not the fastest or best solution but for it's alright.
Here is the code which returns the results in 30 seconds for 25 GiB worth of data:
I had a stupid idea yesterday were I though that it might work out converting the file data to a bitmap and looking for the input within pixels as pixel checking is quite fast.
Just for the giggles... here is the non-optimized test code for that stupid idea: