Pytesseract/OpenCV remove lines

81 Views Asked by At

Fist time here using PyTesseract and OpenCV.

I need to OCR data from picture that will look like this:

enter image description here

I've manually added the red circle to showcase the data element i'm interested in. They are not part of the image.

First step i'm doing is to remove the lines using OpenCV because on the test I've run with the lines, the ocr results was terrible.

import pytesseract
from pytesseract import Output
import cv2 as cv
import numpy as np

image = cv.imread(r'C:\Downloads\testdcs3.jpg')

gray = cv.cvtColor(image,cv.COLOR_BGR2GRAY)
thresh = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV + cv.THRESH_OTSU)[1]

# Remove horizontal lines
horizontal_kernel = cv.getStructuringElement(cv.MORPH_RECT, (40,1))
remove_horizontal = cv.morphologyEx(thresh, cv.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv.findContours(remove_horizontal, cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv.drawContours(thresh, [c], -1, (0,0,0), 5)

# Remove vertical lines
vertical_kernel = cv.getStructuringElement(cv.MORPH_RECT, (1,30))
remove_vertical = cv.morphologyEx(thresh, cv.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv.findContours(remove_vertical, cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv.drawContours(thresh, [c], -1, (0,0,0), 5)

invert =255-thresh

I'm getting this result where i seem to lost a lot of definition.

First question, is there a better way to remove the lines and kept the original image quality?

enter image description here

Second question, i use so far OEM 3 and PSM 4. OCR results give me lot of error. What should i use as custom_config for this type of image?

# setting parameters for tesseract
custom_config = r'--oem 3 --psm 4'
data = pytesseract.image_to_data(invert, output_type='data.frame', config=custom_config)

For example OCR give me in all the test run an euro sign instead of the initial "C" in the Statement of Mailing (see in the red circles). And "LC" plus "0202" is not captured at all which i don't understand!

€908143440

Last question, i was thinking to use image_to_data "left" and "width" position on static keywords to localize the data elements i'm interested in. For example filtering image_to_data to numeric value located right of "Courrier" and left of the end of "quartier" (left + width) would give the only possible numeric value which is the "Total No of items" : 215.

Does that seems like the good way to proceed to localize my data elements in image_to_data?

Thanks!!

enter image description here

0

There are 0 best solutions below