Im implementing a log file viewer with ObjectListView, to be precise my class of choice is VirtualObjectListView.
On the constructor I assign an implementation of the IVirtualListDataSource interface to the VirtualListDataSource:
public LogWindow(List<String> logFiles)
{
InitializeComponent();
// LogSource implements IVirtualListDataSource
OLV_Log.VirtualListDataSource = new LogSource(logFiles);
}
The file(s) I'm processing varies from a few lines to millions of lines so I thought that using a virtual list was the way to go, my problem is that I don't know the numer of lines until I fully read the file which takes a long time for big files.
Each line is taken from the log files using a yield statement:
internal class LogSource : IVirtualListDataSource
{
// ...
public class LogLine { /* whatever */ }
// ...
private IEnumerable<LogLine> Read()
{
foreach (var path in m_logFiles)
{
using var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0x1000, FileOptions.Asynchronous | FileOptions.SequentialScan);
using var streamReader = new StreamReader(fileStream);
for (string? line = String.Empty; line != null; line = streamReader.ReadLine())
{
if (!String.IsNullOrEmpty(line))
{
// process text...
var logLine = new LogLine(/* whatever */);
// do things...
yield return logLine;
}
}
yield break;
}
// ...
}
And added to "cache" on demand:
internal class LogSource : IVirtualListDataSource
{
// ...
public class LogLine { /* whatever */ }
private readonly List<LogLine> m_logLines = new();
// ...
public object GetNthObject(int index)
{
int offset = index - m_logLines.Count + 1;
if (offset > 0)
m_logLines.AddRange(Read().Take(offset));
return m_logLines[index];
}
// ...
public void PrepareCache(int first, int last)
{
GetNthObject(last);
}
// ...
}
So, as I don't know beforehand how many lines exists I don't know what to return from LogSource.GetObjectCount(), here is what I've tried so far:
- Return an arbitrary number.
- Returning a small number (say 500) works as long as the log(s) file(s) contains at least that number of lines, any line count below it causes an (expected) exception at the
return m_logLines[index];instruction while any line count above truncates the result. - Returning
int.MaxValuebehaves as if there were no lines at all (weird!).
- Returning a small number (say 500) works as long as the log(s) file(s) contains at least that number of lines, any line count below it causes an (expected) exception at the
- Return a guess based on the size of the files: Let's say that I consider an average of 75 characters per line so 750 bytes of log files would equal roughly to 10 lines.
- Same problems as previous point.
- Update line number dynamically.
- If I
return m_logLines.CountfromGetObjectCountmyVirtualObjectListViewis not filled since the object count is queried before adding any element tom_logLinesso it is0and there's no call toGetNthObjectnorPrepareCache.
- If I
So, hoy should I use a VirtualObjectListView for it to update the line number dynamically? What should I return from GetObjectCount when I don't know the object count?
Also, any improvement on my code is more than wellcome.
[Update]
I have created Gigantor which is a better and more general solution to the problem of counting lines in very large files. It also includes efficient regular expression searches for very large files. It works by partitioning the file into chunks which are processed in parallel by a pool of worker threads and ultimately consolidated into a single continuous result. On my test machine I got rates up to about 3.4 GBytes/s.
[Original Answer]
I found this
ObjectListViewbut couldn't easily find the definition forIVirtualListDataSourceand was too lazy to search hard. So some of my answer is how I think that interface should work based on experience (ie. hubris).I'll get to your main question in a minute, but First, I think
PrepareCacheandGetNthObjectare behaving badly. Calls toGetNthObjectare reading log lines 0 - N, storing them all in memory asm_logLines, and then throwing away almost everything and selecting only the one that is needed each time the view cache is changed. This approach will be slow and run out of memory for large amounts of log data (which I assume you have).I think you want
PrepareCacheto go grab the log lines specified byfirstandlastfrom the log files and just store those lines in memory. Then calls toGetNthObjectshould return lines already cached in memory by prior call toPrepareCache.Here are some tweaks I made to your LogSource class to facilitate the rest of the discussion.
We need something that can gradually build up the knowledge about which file/line a virtual index references in the background. As this knowledge is built in the background the user should be able to gradually access more and more log data. This
Initializefunction can do that when called in the background (seeInitializeInBackgroundlater in this post). The idea is to create an index of all the files that easily fits into memory. We do not try to store the log data itself because it won't fit. This index could be improved and optimized by tracking more positions in the file, but I chose to keep it pretty simple and just track the start and end of each file.Now to the your main question. Notice how the line at the end of the prior code block calls
VirtualObjectListView.UpdateVirtualListSizeeach time the size is updated. This calls theGetObjectCountmethod of your virtual data source (shown below) which simply returns the current size, and this is why theInitializemethod has a dependency on theVirtualObjectListView.The function below is a helper function called by
PrepareCacheto mapindexto a log line. It will return the log line ifInitializehas progressed far enough or null until it has.Get objects from the cache.
Prepare the cache
Below is an example of how to run
Initializeas a background thread to allow the application to remain responsive while the log files are being processed.The code in this post has not been tested.