We are using Tika as a command line process using the following command:
java -Dlog4j2.formatMsgNoLookups=true -Xms512m -Xmx16384m -jar /tika/tika-app-2.9.1.jar --config=/tika/tika-ocr-config.xml -t test.pdf
Now the test.pdf has some CFF fonts and that is why TIKA is throwing the following error
ERROR [main] 17:05:36,671 org.apache.pdfbox.pdmodel.font.PDCIDFontType0 Can't read the embedded CFF font QLVBNN+HiddenHorzOCR java.io.EOFException: null
The tika-ocr-config.xml is base basic containing just the following parser configration:
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">false</param>
<param name="ocrStrategy" type="string">no_ocr</param>
<!-- whether or not to add processing to detect angles and extract text accordingly PDFBOX-4371 -->
<param name="detectAngles" type="bool">true</param>
</params>
</parser>
```
I know there apache tika has a package to parse CFF font
https://pdfbox.apache.org/docs/2.0.11/javadocs/org/apache/fontbox/cff/package-summary.html
If anyone know how to configure tika to use this CFF font or how to ignore CFF font so that the above error does not appear , it will be of great help to us.
Regards
Rupam