PDFBox Preflight parser is not able to detect PDF/A-1b file

563 Views Asked by At

I am using following code to detect whether a file is PDF/A-1b file or not?

public boolean isPDF_A1BFile(File file) throws IOException {
        PreflightParser parser = new PreflightParser(file);
        parser.parse(Format.PDF_A1B);
        PreflightDocument preflightDocument = parser.getPreflightDocument();
        preflightDocument.validate();

        ValidationResult validationResult = preflightDocument.getResult();
        
        return validationResult.isValid(); //Return false in every case
    }

But it is always returning false irrespective of file is PDF/A-1b or not. I am using this pdf/a-1b file. I have validated using preflight tool in acrobat and it is saying that the file is PDF/A-1b compliance. Sharing the screenshot for the sameenter image description here Can anyone please tell me whats wrong in my code or am I missing something?

Also, is there any way where I can check that the file is PDF/A-2B compliance or not?

1

There are 1 best solutions below

0
K J On

The file is tolerated by some PDF applications, as many will fix such discrepancies but pdf box is detecting many oddities, I did not try to spend much time but the comments seemed potentially valid thus the file is potentially non conformant.

The file Doc1-withHelvetica-pdfa1b.pdf is not a valid PDF/A-1b file, error(s) :
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 32264 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Length}:COSInt{8702};COSName{Subtype}:COSName{XML};COSName{Type}:COSName{Metadata};}; defined length=8702; actual length=8702, starting offset=23561
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 35134 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{2574};COSName{N}:COSInt{3};COSName{Range}:COSArray{COSFloat{0.0};COSFloat{1.0};0;1065353216;0;1065353216;};}; defined length=2574; actual length=2574, starting offset=32559
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 1562 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{202};}; defined length=202; actual length=202, starting offset=1359
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 4486 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Alternate}:COSName{DeviceRGB};COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{2612};COSName{N}:COSInt{3};}; defined length=2612; actual length=2612, starting offset=1873
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 4640 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{17};}; defined length=17; actual length=17, starting offset=4622
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 15067 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{10342};COSName{Length1}:COSInt{27968};}; defined length=10342; actual length=10342, starting offset=4724
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 16081 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{407};}; defined length=407; actual length=407, starting offset=15673
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 22792 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{6627};COSName{Length1}:COSInt{15080};}; defined length=6627; actual length=6627, starting offset=16164
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 23435 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{Length}:COSInt{355};}; defined length=355; actual length=355, starting offset=23079
1.2.2 : Body Syntax error, Expected 'EOL' before the endstream keyword at offset 822 but found '101'
1.2.5 : Body Syntax error, Stream length is invalid [dic=COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{I}:COSInt{93};COSName{Length}:COSInt{85};COSName{S}:COSInt{39};}; defined length=85; actual length=85, starting offset=736

So on the face of it I simply rebuilt the file using "clean" in MuPDF and reran for validation in PDF box.

C:\Apps\PDF\inspectors\Apache\preflight-app-3.0.0-alpha3.jar Doc1-withHelvetica-pdfa1ba.pdf

The file Doc1-withHelvetica-pdfa1ba.pdf is a valid PDF/A-1b file

HOWEVER catch 22, now it fails others validations as it reports

The PDF structure was damaged but has been repaired. Depending on the extent of the damage, some data may theoretically have been lost (although typically this is unlikely).

So recycle by remove PDF/A compatibility and see what's wrong by regenerate as PDF/A and now the report is there is at least 1 bad font definition for Calibri (not surprising as it was previously a word document printout.) What is not obvious is there is a rogue Calibri space character at the end of the line that contains Helvetica Bold and on removal, then that reports other problems so another run through the Editors and finally with all the dross removed, both agree no more problems.

enter image description here