Is it better to do small, more frequent writes or less, bigger writes?

51 Views Asked by At

I am using python to develop a json parser. The idea is to write a json that holds a specific token ($$INSERT_VAR$$) Since this is only a single token whose value(s) I obtain through a command, I think this could be a great enviroment to learn multiprocessing in python.

My idea is a parent process that reads from the input file to append variables to json, launching a child process running the command that obtains the values that have to be written, when a child finishes (SIGCHLD) handle the appropiate data recollection from the (correct) child through a pipe and keep on with the main read loop.

So my first issue on design is data insertion into a file. Since this is basically not possible, my thought is to approach text writes to the resulting file while there isnt a child fetching values, and storing in a variable the text after a token. Which would then append it to the result obtained and writes it wholly to file.

My question would be: Which is better?

  • Doing more, smaller reads to obtain text character by character
  • Doing less frequent, large reads (i.e.: a line or N ammount of characters) which then the parent processes.

I personally am leaning towards the second one, but I would like to know if there are any advantages and/or inconveniences to my approach.

2

There are 2 best solutions below

0
jwal On

The right size is system dependent, however one character at a time will be the slowest everywhere.

The following is a bit rough but might be a useful starting point. This implements enough of the methods to behave like a file and act as an interface between your file reading code and the actual file. You can expand the line.replace with something a bit more general.

a_text_file.txt

Hello $$INSERT_VAR$$
Voila

script.py

class VarReplace():
    def __init__(self, filename):
        self.fn = filename
        self.fh = None
        self.buffer = ''

    def __enter__(self):
        if self.fn is not None:
            self.fh = open(self.fn, 'r')
        return self

    def __exit__(self, _type, _value, _tb):
        if self.fn is not None:
            self.fh.close()

    def read(self, size):
        eof = False
        while self.fn is not None and not eof and len(self.buffer) < size:
            # translation between a chunk-size view of the file and a view
            # of the file based on lines. Removes the possibility of splitting
            # your replace string.
            line = self.fh.readline()
            if line == '':
                eof = True
            else:
                line = line.replace('$$INSERT_VAR$$', 'Bob')
            self.buffer += line
        if len(self.buffer) > size:
            chunk = self.buffer[:size]
            self.buffer = self.buffer[size:]
        else:
            chunk = self.buffer
            self.buffer = ''
        return chunk

file = 'a_text_file.txt'
with VarReplace(file) as stream:
    print(stream.read(8192))

When you run this the output is ...

Hello Bob
Voila
0
Karim Baidar On

In general, doing less frequent, large reads is usually more efficient than doing more, smaller reads. This is because reading from a file involves some overhead, such as moving the file pointer and accessing the disk. By reading larger chunks of data at once, you can reduce the number of times you need to access the disk and therefore improve the performance of your program.

However, the best approach for your specific use case may depend on the structure of your input file and the way you need to process the data. If your input file is well-structured and you can easily determine where to split the data into chunks, then doing large reads may be a good approach. On the other hand, if your input file is not well-structured or if you need to do more complex processing on the data, then doing smaller reads may be easier to implement.