This is the image
from which I am trying to extract data which is in hindi+english, using pytessaract and then I wish to reconstruct it back to an excel file with the same formatting
Firstly, the problem is that I can pass only one language(hindi in this case) to the pytessaract model so I get the result like this:
My code looks like this:
file = "./1.jpg"
text = []
# use pytesseract to read the text from the image
img = cv2.imread(file)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
text.append(pytesseract.image_to_string(img, lang="hin"))
# print(text)
# export
with open('temp.txt', 'w', encoding='utf-8') as file:
file.writelines(text)
I have two questions:
- How do I read both english and hindi texts from this image i.e, how do I pass english and hindi as parameters to the
image_to_string()method - Once I perform this ocr, how do I reconstruct the table and export that dataframe to excel file?