I am implementing a script to convert PGS (i.e. image-based) subtitles to SRT (i.e. text-based) subtitles and I use tesseract for OCR (btw, what a great tool!).
After extracting the subtitle phrases as images and applying some pre-processing, I get decent results. For German subtitles, I have to specify the language (-l deu) to have umlauts properly detected. Interestingly, I get some obviously wrong results which are detected correctly if I don't specify the language to be English or none at all:
For example, this image with a German phrase (eng: ... let's go.)
gives the following results:
- without language specified:
... lass uns gehen. - with
-l eng:... lass uns gehen - with
-l deu:...|48s uns gehen.
Any ideas why this happens and how to improve the results? Could it be that the German model does not cope well with italic fonts?
I have played around with user word files, but could not observe any impact. Using tesseract version 4.1.1 on Ubuntu 20.04.
