How should I handle and work with a large number of files in Node.js?

81 Views Asked by At

I have an array containing approximately 500,000 JSON file paths. The average size of the files is 20-100 kilobytes. Each JSON file contains a numeric value that needs to be read and added to an accumulated total.

If I read the files synchronously in an iteration, the processing quickly becomes slow. If I read the files asynchronously in an iteration, I receive a "too many files open" error.

My question is, what method, iteration, and file handling approach would be suitable for performing this task? I'm relatively new to Node.js and I'm struggling with the above problem.

function readFiles(filePaths) {
    filePaths.forEach(filePath => {
        fs.readFile(filePath, 'utf8', (err, data) => {
            // do somthing
        });
    });
}

readFiles(filePaths); // it's an array of strings
1

There are 1 best solutions below

0
Always Learning On

When you use filePaths.forEach it will whip through the big array quickly, and each time it will start an async fs.readFile call which returns a promise fast then does the work of opening the file and processing it in the background. That means you'll quickly end up with a ton of files open at the same time and you get the problem you're seeing.

To limit how many files are open at once you'll need to make sure you don't create too many async calls simultaneously. One way to do this is to use a PromisePool (see https://www.npmjs.com/package/@supercharge/promise-pool). This allows you to set the number of simultaneous promises to 50 or whatever is good for your situation and it will kick off new file reads as previous ones complete.

Using PromisePool your code might look like this:

import { PromisePool } from '@supercharge/promise-pool'
function readFiles(filePaths) {
    return PromisePool
        .withConcurrency(50)
        .for(filePaths)
        .process(async (filePath) => {
             return fs.readFile(filePath, 'utf8', (err, data) => {
                 // do something
             }
        });
}

readFiles(filePaths); // it's an array of strings

It's essentially the same flow you had but by using PromisePool.for().process() it will make sure only a certain number are active at any one time. This should stop you overloading the resources by having too many files open at once.