Is there a way to make PDF's accessible to more than just Chrome/Acrobat using Node.js?

82 Views Asked by At

I'm scraping PDF attachments from a website that can be opened in Google Chrome and Adobe Acrobat, but if I try to open the files using my Mac's "Preview" app, or if I pass the PDF to a downstream service, it fails to open.

The PDF's not corrupted per se, as I can manually open the PDF in Chrome, print to PDF, and then open it as needed in Preview or wherever.

The problem is, I don't have any automated way to make these PDF's usable across my application. My current idea is to download the file locally, open it using Puppeteer or Playwright, manually print to PDF, and use it from there. I'm not a huge fan of this method as it is slow and error-prone.

Does anyone know of any Node.js libraries where I can "un-corrupt" these PDF's? I'm not exactly sure what Chrome/Acrobat do differently, but I'd love a method to strip whatever it is that's making these PDFs incompatible out of them.

An example PDF can be found here.

1

There are 1 best solutions below

3
mkl On

Analyzing your example PDF one finds that it is linearized (aka "web enhanced") and the offset in the first (front) cross references for the second (end) references is off by a few bytes. The claimed length of the file in the linearization dictionary is less than the actual size (by 7 bytes). Thus, your document is corrupted.

To find out what's actually wrong, I also downloaded the original file. That file does not show the same error, it is shorter by 7 bytes, offsets and lengths in it are correct.

Comparing your file to the original one recognizes that in your file the RDF metadata have been manipulated:

Comparison of RDF metadata

As you see the string "http" therein has been replaced by "https". This replacement is wrong in multiple ways:

First of all, if you change the RDF metadata in a XMP metadata range, your changes must not touch the bytes outside the xpacket begin ... xpacket end range. But in your case the end of the range and everything thereafter has been moved by seven bytes, invalidating the material there.

Furthermore, the change itself is nonsense: Changing "http" to "https" in namespace URLs doesn't make anything more secure but instead damages the XML by potentially preventing lookups. And changing "https://" in the pdfx:_dlc_DocIdUrl to "httpss://" is obvious nonsense.

Thus, you first of all should find out what in your scraping process (or some post-process thereof) damages the metadata like that and switch that off.


I have to change my recommendation above: Apparently (see my comments below) you only get valid PDFs shortly after requesting and viewing (probably including JavaScript actions) the page with the download link.

Thus, it appears you cannot really improve your scraper, there is something weird going on on that web server.