Extraction of complex tables from a pdf using python

168 Views Asked by At

I need to extract data from a single pdf file with just 1 page, which has the following structure:

tables

The numbers of subcolumns may vary from column to columns, as well as the number of rows. There also might be missing (empty) data in some of the columns.

*For clarity purposes, there are sub-subcolumns missing from the structure (each subcolumn always has 3 subcolumns)

The code I use is this:

import tabula.io as tb
import pandas as pd

def toPDFPag2(pathPDF, nPag, pathxlsx):

    table = tb.read_pdf(pathPDF,multiple_tables=True)
    df = pd.concat(table)
  

    #df.to_excel(pathxlsx, sheet_name='Sheet 1')

It usually works, just needing a bit of manual formatting later.

However sometimes it fails to extract the data. It gets only one row or it misses rows. What can I do to fix this?

0

There are 0 best solutions below