How to not lose data while pausing and resuming papaparse streams

98 Views Asked by At

I have a papaparse function for processing large (>5GB) tsvs.

return new Promise((resolve, reject) => {
      Papa.parse<T>(this.file, {
        ...
        step: (result: ParseStepResult<T>, parser: Papa.Parser) => {
          this.batch.push(result.data);
          if (this.batchCondition(this.batch)) {
            this.file?.pause();
            parser.pause();
            this.saveData(this.batch).then(() => {
              this.batch = [];
              parser.resume();
              this.file?.resume();
            });
          }
        },
        complete: async () => {
          await this.saveData(this.batch);
          this.onCompletion();
          resolve();
        },
      });
    });

My issue is that I've found that I absolutely need to call pause() and resume() on both the file ReadableStream and the papaparse Parser objects, otherwise it ends up reading the entire file somehow and running out of memory.

However, between the consecutive pause and resume calls, data is always being lost somehow, since when I rerun on the same file a couple times, I get a non-zero number of new entries found to save.

Pausing the file ReadableStream first, and resuming it last, seems to lose less data than the doing so with the Parser first, but is still non-zero.

Is there an established pattern for this?

0

There are 0 best solutions below