I am facing issue with pdfclown frequently when few pdf files are non english and thier fonts are not recognizing and also i am getting below exception.Please find the pdf path and code path.Load encoding method is failing in both CompositeFont.java and SimpleFont.java. And is there any specific version of jar i need to use for to resolve this issue. Please provide your inputs for to support such pdf files.
java.lang.NullPointerException
at org.pdfclown.documents.contents.fonts.CompositeFont.loadEncoding(CompositeFont.java:178)
at org.pdfclown.documents.contents.fonts.CompositeFont.onLoad(CompositeFont.java:202)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:878)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:368)
at org.pdfclown.documents.contents.fonts.CompositeFont.<init>(CompositeFont.java:114)
at org.pdfclown.documents.contents.fonts.Type0Font.<init>(Type0Font.java:62)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:268)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1360)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:819)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:771)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:764)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:684)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:676)
at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1184)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:636)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:299)
at pdfclown2.highlight(pdfclown2.java:89)
at pdfclown2.main(pdfclown2.java:48)
*****************************other pdf issue*********************************************
java.lang.NullPointerException
at org.pdfclown.documents.contents.fonts.SimpleFont.loadEncoding(SimpleFont.java:150)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:170)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:878)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:368)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:65)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:47)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:262)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1360)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:819)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:771)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:764)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:684)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:676)
at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1184)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:636)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:645)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:653)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:299)
at pdfclown2.highlight(pdfclown2.java:89)
at pdfclown2.main(pdfclown2.java:48)
*********************************another issue**************************************
java.lang.RuntimeException: Odd number of characters.
at org.pdfclown.util.ConvertUtils.hexToByteArray(ConvertUtils.java:106)
at org.pdfclown.objects.PdfString.setValue(PdfString.java:287)
at org.pdfclown.objects.PdfString.<init>(PdfString.java:126)
at org.pdfclown.objects.PdfByteString.<init>(PdfByteString.java:58)
at org.pdfclown.documents.contents.tokens.ContentParser.parsePdfObject(ContentParser.java:182)
at org.pdfclown.documents.contents.tokens.ContentParser.parseOperation(ContentParser.java:164)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:98)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:112)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:112)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObject(ContentParser.java:112)
at org.pdfclown.documents.contents.tokens.ContentParser.parseContentObjects(ContentParser.java:134)
at org.pdfclown.documents.contents.Contents.load(Contents.java:598)
at org.pdfclown.documents.contents.Contents.<init>(Contents.java:372)
at org.pdfclown.documents.contents.Contents.wrap(Contents.java:351)
at org.pdfclown.documents.Page.getContents(Page.java:585)
at org.pdfclown.documents.contents.ContentScanner.<init>(ContentScanner.java:1056)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:300)
at pdfclown2.highlight(pdfclown2.java:3124)
at pdfclown2.main(pdfclown2.java:50)
NullPointerExceptioninSimpleFont.loadEncodingI can reproduce the
NullPointerExceptioninSimpleFont.loadEncodingusing your example file "Sample_Report.pdf". This is caused by an error in the PDF, some font dictionaries in there are missing required entries.I cannot reproduce the other two exceptions using "Sample_Report.pdf", though. Thus, I'll focus on the reproducible issue.
The cause
In your example PDF there are some simple fonts which lack the required FirstChar entry, e.g.:
According to the PDF specification ISO 32000-1 (and similarly ISO 32000-2, too), TrueType font dictionaries contain the same entries as Type1 font dictionaries (with certain differences irrelevant to the case at hand), and the section on Type1 fonts specifies:
The font above is not a standard 14 font. Thus, it is required to have a FirstChar entry. It does not. Thus, this font definition is broken.
PDF Clown, on the other hand, expects PDFs to follow the specification. So it simply retrieves the FirstChar value from the font and immediately uses it which results in the
NullPointerException.A work-around
One can make PDF Clown a bit more lax by making it default to 0 in its
SimpleFontFirstChar lookups. There are two such lookups.In
SimpleFont.loadEncoding()replaceby
and in
SimpleFont.onLoad()similarly replaceby
as it already has been done here.
NullPointerExceptioninCompositeFont.loadEncodingI can reproduce the
NullPointerExceptioninCompositeFont.loadEncodingusing your example file "UnicodeTest.pdf". These exceptions are caused by missing encoding CMaps in PDF Clown.There is a number of Encodings primarily for CJK languages which a conforming PDF processor is expected to support but which PDF libraries (in particular those developed in Europe or the Americas) often don't support out of the box.
PDF Clown expects such encoding CMaps as resources in /fonts/cmap/ in the pdfclown.jar; by default, though, only the generic CMaps Identity-H and Identity-V are there, and none of the specific Chinese/Japanese/Korean CMaps.
You can add the required CMaps to the pdfclown.jar by adding them to the
main\res\pkg\fonts\cmap\folder of the PDF Clown project and building the jar file.You can retrieve all CMaps from the adobe-type-tools/cmap-resources project on github, simply traverse the folder structure of that project and collect the files from the
CMapsubfolders.In case of your example file the CMaps UniCNS-UTF16-H, UniGB-UTF16-H, UniJIS-UTF16-H, and UniKS-UTF16-H sufficed but for an application working with arbitrary PDF files you probably should add all encoding CMaps.