Is there possible the tabula-py extract numeric 007 as 007 instead 7?

12 Views Asked by At

I use tabula-py to extract the pdf table content, the output for numeric as text such as 010019 or 0007 is always convert to float. Is there any way to fix it to return correct value (0007 instead 7.0)

1

There are 1 best solutions below

0
Ray Ronnaret On

I just found a work around solution, instead extract to DataFrame, we can extract to json that will provide all raw info.

input_file = r'31020_FAIL.pdf'
js = read_pdf(input_file, pages='all', lattice=True, output_format = 'json')
print(js[1]['data'][1][2])

Output from my file is in the 'text' as below:

{'top': 116.994576,
 'left': 72.75,
 'width': 42.54998016357422,
 'height': 20.292152404785156,
 'text': '0007'}