Why subprocess with waitpid is crashing?

99 Views Asked by At

I am trying to parallel download urls with the following:

    def parallel_download_files(self, urls, filenames):
        pids = []
        for (url, filename) in zip(urls, filenames):
            pid = os.fork()
            if pid == 0:
                open(filename, 'wb').write(requests.get(url).content)
            else:
                pids.append(pid)
        for pid in pids:
            os.waitpid(pid, os.WNOHANG)

But when executing with a list of urls and filenames, the computer system is building up in memory and crashing. From the documentation, I thought that the options in waitpid should be correctly handled if setting it to os.WNOHANG. This is the first time I am trying parallel with forks, I have been doing such tasks with concurrent.futures.ThreadPoolExecutor before.

1

There are 1 best solutions below

3
SIGHUP On

Using os.fork() is far from ideal especially as you're not handling the two processes that are being created (parent/child). multithreading is far superior for this use-case.

For example:

from concurrent.futures import ThreadPoolExecutor as TPE
from requests import get as GET


def parallel_download_files(urls, filenames):
    def _process(t):
        url, filename = t
        try:
            (r := GET(url)).raise_for_status()
            with open(filename, 'wb') as output:
                output.write(r.content)
        except Exception as e:
            print('Failed: ', url, filename, e)
    with TPE() as executor:
        executor.map(_process, zip(urls, filenames))

urls = ['https://www.bbc.co.uk', 'https://news.bbc.co.uk']
filenames = ['www.txt', 'news.txt']

parallel_download_files(urls, filenames)

Note:

If any filenames are duplicated in the filenames list then you'll need a more complex strategy that ensures that you never have more than one thread writing to the same file