The main goal of my program is to find all occurrences of a certain substring in file and provide an opportunity to not only get the "String" the certain substring is, but also get strings before and after it.
My main problem is that a file may be really big (Gb or more) and contain only one line. Besides, I must use multithreading to do this task.
So far I want to divide a file in overlapping parts of 2Mb or more using RandomAccessFile (method readFully) and let the threads search for a substring using Boyer Moore Algorithm. Thus, I probably could find the offset of the substring beginning.
- I want to use overlapping parts, because part of substring may be in one part of file, and another in the second one
- 2 Mb is just a min number, the length of parts should probably depend on file length in bytes
However, I don't have an idea on how to get a "String" containing substring and strings before and after. Because:
- a "String" may be divided
- the strings before/after the "String" may be in another part or there may be no strings above/below, if the "String" is the first/last in file
May be I should somehow use RandomAccessFile and BufferedReader together?
You seem to be asking for notes and hints about how to solve this problem.
That means 'I will just chunk up the file in 2MB chunks' could mean a chunk can start in the middle of a character (charset encodings like UTF-8 can take more than 1 byte to store a character) which complicates matters. One solution is to take your 'needle' (the string you are searching for), turn that into bytes (e.g.
needle.getBytes(StandardCharsets.UTF_8)), and search for those.Overlap
If 'needle' is small, you can overlap. Say you have a 9MB file, you want to chunk in 2MB blocks, and needle is 50 bytes long. The overlap solution dictates that you extend the endpoint of every chunk 49 bytes (1 less than the size of needle). So, the first chunk isn't precisely 0-2097151 (2MB worth). It's 49 more than that - it goes from 0 to 2097200. And the second chunk does start at 2097151 (there is 49 overlap). This way it's still not possible for both chunks to report the needle, but you are guaranteed that if there is a needle right at that boundary, one of em will. This becomes unwieldy if the needle is very large.
Half-match passing
A solution that can deal with large needles is to pass boundary half matches to a collector.
Generally in such multithread jobs you use the map-reduce concept:
This model means you can half-match. The thread that processes a single, say, 2 MB chunk of data turns that into a result. A result consists of any number of 'found it!' entries, and they come in 3 flavours:
Hello, World!and this chunk starts withorld!. That does not mean a needle is actually found, but it is part of what this chunk processor must report to the reducer.Hello, Wor).The reducer can line up reports of needle-prefixes and needle-suffixes (e.g. for
Hello, World!, which is 13 long, if chunk 14 reports 'needle-prefix = 8' and chunk 15 reports 'needle-suffix=5`, there's a needle in the file at chunk15start minus 8.The nananananaba issue (which is what Boyer Moore tries to solve) must also be applied here! If needle is 'nananaba' and one chunk ends in 'nanana' and the next chunk starts with 'nanaba', then there is a needle there, but it doesn't start at the first chunk's 'nanana'. It starts 4 characters later.
This model is called map-reduce:
Each chunk processor maps the input data onto an intermediate result value. Generally results are vastly smaller than inputs.
A single reducer (sometimes running on its own dedicated thread, but it can simply be done by a worker thread in between doing a mapping operation) will take multiple mappings (these result values) and reduce them down to a single result value. Keep going until there's only one such result value left, from which it should be trivial to produce the required answer. The job of squashing 2+ mappings down to a single one is called reducing.
Hence, map-reduce. You can read on e.g. wikipedia for more details if you are intrigued.
It depends on lots of factors, but on many, many systems, using multiple threads to do extremely basic processing (and this is extremely basic, even Boyer-Moore is) on files read from disk is pointless - the disk is the bottleneck, not the CPU. For example, trivially, on spinning disks (which, admittedly, are kinda old news now), this will be significantly slower, because to provide data to all the 20 or whatnot threads all asking for data, the disk needs to spin and spin, giving each thread data from separate slices of the disk. The disk is entirely the bottleneck so the only job is to ensure the disk can provide the bytes as efficiently as possible. Which is, for a contiguous file (one where all bytes are layered one after another), is best done by having one thread read the bytes from start to finish in one go.
At the other extreme, some sort of fast-NVMe multi-raid getup means contiguous access has no bearing on performance, and the CPU is probably the bottleneck and not the raid setup, so there multi-core would help.
I'd bet on multicore making things, if anything, slower. Processing 20GB worth of data off a middling speed disk is going to take a few seconds. Throwing threads at the problem isn't going to change this.
It mostly doesn't matter. But, fire up the right number of threads.
Generally you know how many threads you want (usually about 2x the amount of CPU cores you have, but, really, 'CPU cores amount of threads' is all you need). Thus, you can figure out chunk size by taking total file size and dividing by that. However, it really doesn't matter. Assuming a disk that does not care whatsoever about how you pull bytes off of it (highly unlikely, see the previous point!) - If you have a 10TB file and go with a chunk size of, say, 512MB, that means 2048 chunks. It's rather unlikely your CPU has 2048 cores. It doesn't matter though: If it has 20 cores, have 20 threads running, and each thread will just process a 500MB chunk, then move on to another chunk to work on, it's not going to have a meaningful effect on how fast this process is going to go. As long as the chunk sizes aren't extremist (a chunk size of, say, 20 bytes, that's going to be a problem), and all the cores will generally be busy (trivial case: If 10 chunks total and 20 cores, that'd be bad, you want at least 20 jobs. Preferably more like 200 so things can balance out if some chunks end up getting processed much faster than other ones) - you'll be fine.
The fork/join framework is designed for this stuff, you might want to look into it.