I have used pdfplumber with perfect results but am stumped with this one. I want to extract the data from a table on a PDF page. I crop the data on the page to contain a table, set the vertical lines and extract_table fails to find the rightmost column. For comparison when I extract_text from the same cropped area it finds all the expected data. In the output below the 3rd VAT column is missing from the table lines.
import os, sys, pdfplumber
print('pdfplumber v.'+pdfplumber.__version__, \
'Python v.'+str(sys.version_info.major)+'.'+\
str(sys.version_info.minor),'\n')
vlines = [44.6, 95.65, 124.0, 186.35, 277.05, 322.4, 367.75, 421.6, \
464.15, 506.7, 544.7539999999999]
boundingbox = (45, 285, 550, 350)
pdf = pdfplumber.open('test.pdf')
for pageno, page in enumerate(pdf.pages,1):
if pageno == 4:
pagecropped = page.crop(boundingbox, strict=True)
textlines = pagecropped.extract_text()
textlines = textlines.split('\n')
for textline in textlines:
print('text line:',textline)
print()
tablelines = pagecropped.extract_table(table_settings=\
{"vertical_strategy":"explicit",\
"explicit_vertical_lines":vlines,\
"horizontal_strategy": "text",\
"snap_tolerance": 5})
for tableline in tablelines:
print('table line:', tableline)
pdfplumber v.0.11.0 Python v.3.12
text line: UK calls
text line: Date Time Phone number Destination Duration Charged Included? VAT VAT VAT
text line: hh:mm:ss hh:mm:ss ex rate inc
text line: Mon 20 Jul 13:44 01435883510 Landline 00:00:48 00:01:00 Yes £0.000 20% £0.000
text line: Mon 20 Jul 13:45 07818648038 Vodafone mobile 00:00:04 00:01:00 Yes £0.000 20% £0.000
text line: Wed 29 Jul 08:57 121 Voicemail 00:00:57 00:01:00 Yes £0.000 20% £0.000
text line: Fri 31 Jul 18:16 01509633033 Landline 00:00:57 00:01:00 Yes £0.000 20% £0.000
table line: ['UK calls', '', '', '', '', '', '', '', '']
table line: ['Date', 'Time', 'Phone number', 'Destination', 'Duration', 'Charged', 'Included?', 'VAT', 'VAT']
table line: ['', '', '', '', 'hh:mm:ss', 'hh:mm:ss', '', 'ex', 'rate']
table line: ['Mon 20 Jul', '13:44', '01435883510', 'Landline', '00:00:48', '00:01:00', 'Yes', '£0.000', '20%']
table line: ['Mon 20 Jul', '13:45', '07818648038', 'Vodafone mobile', '00:00:04', '00:01:00', 'Yes', '£0.000', '20%']
table line: ['Wed 29 Jul', '08:57', '121', 'Voicemail', '00:00:57', '00:01:00', 'Yes', '£0.000', '20%']
table line: ['Fri 31 Jul', '18:16', '01509633033', 'Landline', '00:00:57', '00:01:00', 'Yes', '£0.000', '20%']