tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

176 Views Asked by At

I am trying to read the passport mrz string from the image i am using Tesseract and OpenCV for image processing i have tried three different ways none of them worked

Attempt 1 I have this image orignal mrz image when i do ocr on it teseract read as

IDAUT10000999<6<<<<<<<<<<<<<<<
7109094F1112315AUT<<<<<<xcc<<6
MUSTERFRAU<<ISOLDE<<<<<<<<cc<<

which is incorrect it treats <<< as x or c or k when I use the mrz-java library to read the details from the string it gives the following error

[error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 IDAUT10000999<6<<<<<<<<<<<<<<<
[error] 7109094F1112315AUT<<<<<<xcc<<6
[error] MUSTERFRAU<<ISOLDE<<<<<<<<cc<<
[error]  at 24-25,1: Invalid character in MRZ record: x

Attempt 2

then I converted the image to grayscale and binarized it using OpenCV Here is the below code

    val roiImagePath = "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
    
    val grayScaleROI = new Mat()
      val roiImage = Imgcodecs.imread(roiImagePath)
      Imgproc.cvtColor(roiImage, grayScaleROI, Imgproc.COLOR_BGR2GRAY)
      val roiGaryImagePath = "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
    
      Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
      val binary = new Mat()
      Imgproc.adaptiveThreshold(grayScaleROI, binary, 255, Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
      val roiBinaryImagePath = "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
      Imgcodecs.imwrite(roiBinaryImagePath, binary)

 val tesseract = new Tesseract()
  tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
  tesseract.setVariable("user_defined_dpi", "600")
  val result = tesseract.doOCR(new File(roiBinaryImagePath))
  val mrzStr = result.replace(" ", "")
  println(s"two page passport mrz string is: "+mrzStr)

it created the following binary image enter image description here

and the code output is tesseract reads mrz string from the binary image as

IDAUT1DODD999<E<KK<KKKKEKEKEK
7AD9D9GF1TEZSISAUTKKKKKKKKKEKG
MUSTERFRAUSKISOLDEKKKKKKKKKKK

and mrz-java reads the string and generates the following error

[error] Error parsing MRZ string: Failed to parse MRZ null IDAUT1DODD999<E<KK<KKKKEKEKEK
[error] 7AD9D9GF1TEZSISAUTKKKKKKKKKEKG
[error] MUSTERFRAUSKISOLDEKKKKKKKKKKK
[error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30

Attempt 3

then I resized the image

Val width = 1000 // Increase width proportionately (adjust based on your needs)
  val height = (width * binary.rows()) / binary.cols() // Maintain aspect ratio

  val resizedRoiImage = new Mat()
  Imgproc.resize(binary, resizedRoiImage, new Size(width, height), 0.0, 0.0, Imgproc.INTER_NEAREST)

  val resizedImageROIPath =  "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
  Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)

resized image mrz string read by Tesseract

TOAUTIOOOOIISKhcceccccddddddce
FIOPOSAFIFESSISAUTReececeececs
MUSTERFRAUCCKISOLDECKccccdcddd

and the error is

[info] 15:54:04.200 633 [main] MrzParser INFO - Check digit verification failed for document number: expected 0 but got h
[error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 TOAUTIOOOOIISKhcceccccddddddce
[error] FIOPOSAFIFESSISAUTReececeececs
[error] MUSTERFRAUCCKISOLDECKccccdcddd
[error]  at 15-16,0: Invalid character in MRZ record: c

can anyone please help how I read the text properly also I have tried one regex to convert c or k back to <<< it did not work either if anyone can suggest some workaround or any improvement in code please help me with that thanks

0

There are 0 best solutions below