I am using pytesseract to extract text from images. These JPGs are created from pdf.
On some images it is hitting 98% + accuracy. Maybe a dot or a digit missing here and there. But there are times it is generating extra letters and symbols that is not present in the picture. Just unnecessary noise. Here are some examples-
Actual:
Oct 05 Pre-Authorized Payment, JOHNSON/UNIFUND
INS/ASS 578.03
Generated:
Oct 05 Pre-Authorized Payment, JOHNSON/UNIFUND 57803. ~SOC*~C~C~C~S~S
INS/ASS
Actual:
Oct 07 Direct Deposit, MEDAVIE BLUE CR MSP/OV. 215.18
Generated:
Oct 07 Direct Deposit, MEDAVIE BLUECR MSP/OV. SS SCSCSCS<;7<;73 ;7CTCSS 215.18
In the first example it generated this - ~SOC*~C~C~C~S~S and in the second this - SS SCSCSCS<;7<;73 ;7CTCSS. that is nowhere in the image. These are just plain noise.
These are samples. There are pages that is worse.
Any idea why it is happening? I know OCR will probably not be 100% so I don't want to discard the code. But these extra noises does take time to remove from text files and overall that time adds up.
Image I am using is high DPI (300). It is black text on white background. It may not be the best of quality but very clear. Extra noise is the problem here, not missing text.
Here is the actual image -
