How to find certain lines in file using Java

70 Views Asked by At

The main goal of my program is to find all occurrences of a certain substring in file and provide an opportunity to not only get the "String" the certain substring is, but also get strings before and after it.
My main problem is that a file may be really big (Gb or more) and contain only one line. Besides, I must use multithreading to do this task.

So far I want to divide a file in overlapping parts of 2Mb or more using RandomAccessFile (method readFully) and let the threads search for a substring using Boyer Moore Algorithm. Thus, I probably could find the offset of the substring beginning.

  • I want to use overlapping parts, because part of substring may be in one part of file, and another in the second one
  • 2 Mb is just a min number, the length of parts should probably depend on file length in bytes

However, I don't have an idea on how to get a "String" containing substring and strings before and after. Because:

  • a "String" may be divided
  • the strings before/after the "String" may be in another part or there may be no strings above/below, if the "String" is the first/last in file

May be I should somehow use RandomAccessFile and BufferedReader together?

3

There are 3 best solutions below

1
rzwitserloot On

You seem to be asking for notes and hints about how to solve this problem.

  1. Note that files are bytes, but text search implies charset encoding.

That means 'I will just chunk up the file in 2MB chunks' could mean a chunk can start in the middle of a character (charset encodings like UTF-8 can take more than 1 byte to store a character) which complicates matters. One solution is to take your 'needle' (the string you are searching for), turn that into bytes (e.g. needle.getBytes(StandardCharsets.UTF_8)), and search for those.

  1. There are 2 ways to solve the 'what if the haystack contains the needle, but my chunking process chunks halfway through a needle?' issue.

Overlap

If 'needle' is small, you can overlap. Say you have a 9MB file, you want to chunk in 2MB blocks, and needle is 50 bytes long. The overlap solution dictates that you extend the endpoint of every chunk 49 bytes (1 less than the size of needle). So, the first chunk isn't precisely 0-2097151 (2MB worth). It's 49 more than that - it goes from 0 to 2097200. And the second chunk does start at 2097151 (there is 49 overlap). This way it's still not possible for both chunks to report the needle, but you are guaranteed that if there is a needle right at that boundary, one of em will. This becomes unwieldy if the needle is very large.

Half-match passing

A solution that can deal with large needles is to pass boundary half matches to a collector.

Generally in such multithread jobs you use the map-reduce concept:

  1. you map some input into chunks.
  2. Each chunk is independent (to calculate the result for a chunk, you just need the data in it; you do not need to know about any data from beyond the chunk).
  3. A number of working threads pull 'chunk jobs' (a simple descriptor describing the chunk to work on) off the pipe and map them to a muuuch smaller, simpler result value.
  4. A final thread gathers all results and reduces them. Optimally, this is streaming (any 2+ chunks can be reduced to a single one), but it's generally fine if this process cannot reduce until all chunk results are in. It's fine if the reducer needs, for example, the results of contiguous chunks to do its job.

This model means you can half-match. The thread that processes a single, say, 2 MB chunk of data turns that into a result. A result consists of any number of 'found it!' entries, and they come in 3 flavours:

  • Found a needle at X. (In the chunk, the entire needle is present starting at bytepos X and ending at bytepos X+needlesize).
  • Found a needle-suffix of length Y (The chunk starts with a bunch of bytes that overlap with the end of the needle. For example, the needle is Hello, World! and this chunk starts with orld!. That does not mean a needle is actually found, but it is part of what this chunk processor must report to the reducer.
  • Same, but on the other side: Found a needle-prefix of length Y (the chunk ends in Hello, Wor).

The reducer can line up reports of needle-prefixes and needle-suffixes (e.g. for Hello, World!, which is 13 long, if chunk 14 reports 'needle-prefix = 8' and chunk 15 reports 'needle-suffix=5`, there's a needle in the file at chunk15start minus 8.

The nananananaba issue (which is what Boyer Moore tries to solve) must also be applied here! If needle is 'nananaba' and one chunk ends in 'nanana' and the next chunk starts with 'nanaba', then there is a needle there, but it doesn't start at the first chunk's 'nanana'. It starts 4 characters later.

This model is called map-reduce:

Each chunk processor maps the input data onto an intermediate result value. Generally results are vastly smaller than inputs.

A single reducer (sometimes running on its own dedicated thread, but it can simply be done by a worker thread in between doing a mapping operation) will take multiple mappings (these result values) and reduce them down to a single result value. Keep going until there's only one such result value left, from which it should be trivial to produce the required answer. The job of squashing 2+ mappings down to a single one is called reducing.

Hence, map-reduce. You can read on e.g. wikipedia for more details if you are intrigued.

  1. This is probably pointless

It depends on lots of factors, but on many, many systems, using multiple threads to do extremely basic processing (and this is extremely basic, even Boyer-Moore is) on files read from disk is pointless - the disk is the bottleneck, not the CPU. For example, trivially, on spinning disks (which, admittedly, are kinda old news now), this will be significantly slower, because to provide data to all the 20 or whatnot threads all asking for data, the disk needs to spin and spin, giving each thread data from separate slices of the disk. The disk is entirely the bottleneck so the only job is to ensure the disk can provide the bytes as efficiently as possible. Which is, for a contiguous file (one where all bytes are layered one after another), is best done by having one thread read the bytes from start to finish in one go.

At the other extreme, some sort of fast-NVMe multi-raid getup means contiguous access has no bearing on performance, and the CPU is probably the bottleneck and not the raid setup, so there multi-core would help.

I'd bet on multicore making things, if anything, slower. Processing 20GB worth of data off a middling speed disk is going to take a few seconds. Throwing threads at the problem isn't going to change this.

  1. How to determine chunk size?

It mostly doesn't matter. But, fire up the right number of threads.

Generally you know how many threads you want (usually about 2x the amount of CPU cores you have, but, really, 'CPU cores amount of threads' is all you need). Thus, you can figure out chunk size by taking total file size and dividing by that. However, it really doesn't matter. Assuming a disk that does not care whatsoever about how you pull bytes off of it (highly unlikely, see the previous point!) - If you have a 10TB file and go with a chunk size of, say, 512MB, that means 2048 chunks. It's rather unlikely your CPU has 2048 cores. It doesn't matter though: If it has 20 cores, have 20 threads running, and each thread will just process a 500MB chunk, then move on to another chunk to work on, it's not going to have a meaningful effect on how fast this process is going to go. As long as the chunk sizes aren't extremist (a chunk size of, say, 20 bytes, that's going to be a problem), and all the cores will generally be busy (trivial case: If 10 chunks total and 20 cores, that'd be bad, you want at least 20 jobs. Preferably more like 200 so things can balance out if some chunks end up getting processed much faster than other ones) - you'll be fine.

The fork/join framework is designed for this stuff, you might want to look into it.

0
user22852225 On
  1. it's true that some UTF-8 characters consist of 2 and more bytes, as the answer above mentions, but it's possible that for your particular task the texts will only contain ASCII characters (which are always 1 byte). So I would recommend you to ask about that specifically, because you might save yourself from dealing with multi-byte characters.
  2. using RandomAccessFile is great. For example, after finding your line you can get the line before it by reading the file in the backwards direction until the newline symbol.
  3. regarding "divided strings": if the text at the end of your 2MB part corresponds to the beginning of the line you search, then you can just continue reading the file beyond this 2MB part - it's a RandomAccessFile after all.
0
Lajos Arpad On

You are correct with the idea that you can divide the String into substrings and search for matches via threads. You are also correct that a match may start in one of the chunks and end at the next chunk. So, it seems that in your case the solution is to:

  • if it's not the first thread
    • get the bigS.substring(0, s.length() - 1) of your current chunk, let's call it start
    • search for matches of s in end + start where end is the substring of your previous String at its very end of s.length() - 1 size
  • search for matches of s normally
  • if it's the last thread
    • save bigS.substring(bigS.length() - s.length() + 1) as end` so the next chunk will know about this

Now, since you have to work with threads, the algorithm above cannot be implemented sequentially, so we modify the idea to divide the String to chunks like suggested, each chunk will be worked by a thread (careful with the number of threads though) and each thread will save the matches of its chunk, but also return its start and end as values (see Returning value from Thread).

This way when you iterate your threads and join() them sequentially, you will always be able to glue together previousEnd + start and then override previousEnd with the end your current thread returned to find between-chunk matches.