Pytesseract/OpenCV remove lines

81 Views Asked by Maquisard At 13 February 2024 at 15:56

Fist time here using PyTesseract and OpenCV.

I need to OCR data from picture that will look like this:

I've manually added the red circle to showcase the data element i'm interested in. They are not part of the image.

First step i'm doing is to remove the lines using OpenCV because on the test I've run with the lines, the ocr results was terrible.

import pytesseract
from pytesseract import Output
import cv2 as cv
import numpy as np

image = cv.imread(r'C:\Downloads\testdcs3.jpg')

gray = cv.cvtColor(image,cv.COLOR_BGR2GRAY)
thresh = cv.threshold(gray, 0, 255, cv.THRESH_BINARY_INV + cv.THRESH_OTSU)[1]

# Remove horizontal lines
horizontal_kernel = cv.getStructuringElement(cv.MORPH_RECT, (40,1))
remove_horizontal = cv.morphologyEx(thresh, cv.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv.findContours(remove_horizontal, cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv.drawContours(thresh, [c], -1, (0,0,0), 5)

# Remove vertical lines
vertical_kernel = cv.getStructuringElement(cv.MORPH_RECT, (1,30))
remove_vertical = cv.morphologyEx(thresh, cv.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv.findContours(remove_vertical, cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv.drawContours(thresh, [c], -1, (0,0,0), 5)

invert =255-thresh

I'm getting this result where i seem to lost a lot of definition.

First question, is there a better way to remove the lines and kept the original image quality?

Second question, i use so far OEM 3 and PSM 4. OCR results give me lot of error. What should i use as custom_config for this type of image?

# setting parameters for tesseract
custom_config = r'--oem 3 --psm 4'
data = pytesseract.image_to_data(invert, output_type='data.frame', config=custom_config)

For example OCR give me in all the test run an euro sign instead of the initial "C" in the Statement of Mailing (see in the red circles). And "LC" plus "0202" is not captured at all which i don't understand!

€908143440

Last question, i was thinking to use image_to_data "left" and "width" position on static keywords to localize the data elements i'm interested in. For example filtering image_to_data to numeric value located right of "Courrier" and left of the end of "quartier" (left + width) would give the only possible numeric value which is the "Total No of items" : 215.

Does that seems like the good way to proceed to localize my data elements in image_to_data?

Thanks!!

Original Q&A

Pytesseract/OpenCV remove lines

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in OCR

Related Questions in TESSERACT

Related Questions in PYTHON-TESSERACT

Related Questions in IMAGE-PREPROCESSING

Trending Questions

Popular # Hahtags

Popular Questions