I want to extract semi structured tables from PDF files. I might consider other modules than pdfplumber if they can work better. I need not only table, but sometimes text above table is still a part of the table (for example name of columns sometimes are above table), or table is continued on the other page.
I tried using extract_text_lines() and It works fine. I want to check pdf line by line and if line is a table - I start collecting this data.
def extract_table_from_page(pdf_path, page_number):
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[page_number]
lines = page.extract_text_lines()
for line in lines:
if 'chars' in line.keys():
print(line)
Here is a PyMuPDF example of a table having external column headers in a number of different header text rotation angles - including multi-line column headers.
Some of the column names are vertical.
Here is a PyMuPDF script which finds and extracts the table, identifies the column names and prints table contents in markdown format (Github-compatible):
BTW: Other formats are available too, like a Python list of lists or output to pandas DataFrame.
Note: I am a maintainer and the original creator of PyMuPDF.