Corrupted documents when copying PDF pages with images to new documents using PDFBox 3.0.1

22 Views Asked by At

I have a PDF document with multiple pages (including an image on each page) that I need to read and make a copy split into a number of PDF documents below a certain size. The code below works without issue in PDFBox 3.0.0 but produces corrupted PDF documents in PDFBox 3.0.1.

long currentSize = 0L;
try (PDDocument document = Loader.loadPDF(file)) {
    List<PDDocument> documentList = new ArrayList<>();
    for (int i = 0; i < document.getPages().getCount(); i++) {
        PDDocument sizeCheck = new PDDocument();
        sizeCheck.addPage(document.getPage(i));
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        sizeCheck.save(baos);
        long pageSize = baos.size();

        if (currentSize==0L || (currentSize+pageSize) > 7485760L) {
            PDDocument tempDoc = new PDDocument();
            tempDoc.addPage(document.getPage(i));
            documentList.add(tempDoc);
            currentSize=pageSize;
        } else {
            documentList.get(documentList.size() - 1).addPage(document.getPage(i));
            currentSize+=pageSize;
        }
    }

    for (PDDocument doc : documentList) {
        File temp = Files.createTempFile(UUID.randomUUID(), ".pdf").toFile();
        doc.save(temp);
        doc.close();
    }
}

The corrupted PDFs produced display the image on the first page but all subsequent pages do not have the image and get the warning message 'There was an error processing a page. There was a problem reading this document (18).'

Reverting to PDFBox 3.0.0 eliminates the error and the PDF documents display with the image as intended.

This only happens with a PDF document with images. If I run the same code with a PDF containing text only the PDFs produced have no issues.

It is specifically caused by saving to the ByteArrayOutputStream I am using to get the page size. If I remove this the issue does not occur.

0

There are 0 best solutions below