How to read a table that spans muiltiple pages in a PDF without knowing how many pages there are?

194 Views Asked by At

I am trying to read a table in a PDF and turn it into a pandas data frame. However some of the tables span multiple pages and I don't know which ones. Is there a way to read the table as 1 and not give tabula the pages?

I have tried to just read the table but it cuts it at the end of the page. I have tried to do "Lattice=True" but that only combines the lines in a cell.

1

There are 1 best solutions below

0
TFR On

I have found a solution. I had to create a function but it seems to work well.

import tabula
import pandas as pd  

def read_table(start_page:int, file_name:str):
same_table = True
page = start_page

main_df = pd.DataFrame()

while same_table:
    try:
        tables = tabula.read_pdf(file_name, pages=page, lattice=True)
        temp_df = tables[0]
    except(IndexError):
        same_table = False

    if not '{keyword in every header}' in str(list(temp_df.columns.values)[0]):
        same_table = False

    main_df = pd.concat([series_df, temp_df], axis=0, join='outer')

    page += 1