The goal: scan papers on a document scanner into one PDF file and then process them by OpenCV.
The expected result is a program like this:
- Extract one image from a PDF file as something binary.
- Convert it into OpenCV's Mat.
- Treat the Mat like image processing, line detection, and so on.
- Repeat for the next image.
The program use PyMuPDF package for extract images from PDF file:
with fitz.Document(file) as doc.image_dict = doc.extract_image(xref). An alternative way is create a Pixmap.np.dtype(f'u{image_dict["bpc"] // 8}')wheref'u{image_dict["bpc"] // 8}'is Array-protocol type strings like 'u1', 'u2' which is NumPy data type as one-byte and two-byte unsigned integer. See notes.Notes:
for page in dociterates over pages in PDF document, the secondfor xref in page.get_images(False)iterates over images' XREF located on the page. The conditionif xref[1] == 0cuts off “pseudo-images” (“stencil masks”) with the special purpose of defining the transparency of some other image. Yes, alpha-channel (transparency) be destroyed. Very likely in the case of scanned papers, images don't contain “stencil masks”, so it may be overkill.