How to modify the following command lines from a python file to convert .pdf to .txt file?

31 Views Asked by At

I took from the web certain command lines into a python file. I then put the file in a folder, together with the PDF and then I tried to convert it to .txt using command prompt. However, these command lines only extract the text, and cmd is too small to contain all these characters. Sometimes it could be 300 pages long. I need to convert it to .txt.

Anyway, here're the commands:

import pdf2image
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


def pdf_to_img(pdf_file):
    return pdf2image.convert_from_path(pdf_file)


def ocr_core(file):
    text = pytesseract.image_to_string(file, lang='eng')
    return text


def print_pages(pdf_file):
    images = pdf_to_img(pdf_file)
    for pg, img in enumerate(images):
        print(ocr_core(img))


print_pages('1.pdf')

I modified the title of the pdf.

I tried to find youtube tutorial videos, but without much success. I was expecting a video with the title "How to convert pdf to txt with tesseract" or something similar.

0

There are 0 best solutions below