Tesseract generating extra symbols, letters and noises. How to stop it?

194 Views Asked by user18004387 At 07 February 2022 at 01:40

I am using pytesseract to extract text from images. These JPGs are created from pdf.

On some images it is hitting 98% + accuracy. Maybe a dot or a digit missing here and there. But there are times it is generating extra letters and symbols that is not present in the picture. Just unnecessary noise. Here are some examples-

Actual:
Oct 05 Pre-Authorized Payment, JOHNSON/UNIFUND
INS/ASS 578.03


Generated:
Oct 05 Pre-Authorized Payment, JOHNSON/UNIFUND 57803. ~SOC*~C~C~C~S~S
INS/ASS


Actual:
Oct 07 Direct Deposit, MEDAVIE BLUE CR MSP/OV. 215.18


Generated:
Oct 07 Direct Deposit, MEDAVIE BLUECR MSP/OV. SS SCSCSCS<;7<;73 ;7CTCSS 215.18

In the first example it generated this - ~SOC*~C~C~C~S~S and in the second this - SS SCSCSCS<;7<;73 ;7CTCSS. that is nowhere in the image. These are just plain noise.

These are samples. There are pages that is worse.

Any idea why it is happening? I know OCR will probably not be 100% so I don't want to discard the code. But these extra noises does take time to remove from text files and overall that time adds up.

Image I am using is high DPI (300). It is black text on white background. It may not be the best of quality but very clear. Extra noise is the problem here, not missing text.

Here is the actual image -

Original Q&A

Tesseract generating extra symbols, letters and noises. How to stop it?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in OCR

Related Questions in TESSERACT

Related Questions in NOISE

Related Questions in EXTRA

Trending Questions

Popular # Hahtags

Popular Questions