I have a PDF document with multiple pages (including an image on each page) that I need to read and make a copy split into a number of PDF documents below a certain size. The code below works without issue in PDFBox 3.0.0 but produces corrupted PDF documents in PDFBox 3.0.1.
long currentSize = 0L;
try (PDDocument document = Loader.loadPDF(file)) {
List<PDDocument> documentList = new ArrayList<>();
for (int i = 0; i < document.getPages().getCount(); i++) {
PDDocument sizeCheck = new PDDocument();
sizeCheck.addPage(document.getPage(i));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
sizeCheck.save(baos);
long pageSize = baos.size();
if (currentSize==0L || (currentSize+pageSize) > 7485760L) {
PDDocument tempDoc = new PDDocument();
tempDoc.addPage(document.getPage(i));
documentList.add(tempDoc);
currentSize=pageSize;
} else {
documentList.get(documentList.size() - 1).addPage(document.getPage(i));
currentSize+=pageSize;
}
}
for (PDDocument doc : documentList) {
File temp = Files.createTempFile(UUID.randomUUID(), ".pdf").toFile();
doc.save(temp);
doc.close();
}
}
The corrupted PDFs produced display the image on the first page but all subsequent pages do not have the image and get the warning message 'There was an error processing a page. There was a problem reading this document (18).'
Reverting to PDFBox 3.0.0 eliminates the error and the PDF documents display with the image as intended.
This only happens with a PDF document with images. If I run the same code with a PDF containing text only the PDFs produced have no issues.
It is specifically caused by saving to the ByteArrayOutputStream I am using to get the page size. If I remove this the issue does not occur.