Extract specific information from multiple page pdf (and multiple pdf files from a folder) to CSV

383 Views Asked by Paulzzz At 21 July 2023 at 14:17

I want to extract the following information from the attached sample pdf file.

Exporter/supplier Name
Customs Reference No
Country of Origin
Item Description
Invoice No
Assessment Date
Import/Registration Date
Taxable Value
Total VAT

to be done on multiple pdf files and they have the same per page format.

In this drive folder are my reference files - https://drive.google.com/drive/folders/1QjSkC3oiTGSwrOdCEurI9aCN9awOa1C-?usp=sharing'

Currently, what I'm doing is using this code to save only # 1 to 7. I used pdfquery and pandas.

    import pdfquery
    import pandas as pd

    file = 'PDFproject/65140.pdf'

    pdf = pdfquery.PDFQuery(file)
    pdf.load()
    pdf.tree.write('pdfXML.txt', pretty_print = True)

    def pdfscrape(pdf):
    # Extract each relevant information individually
        IERD_ENTRY_NUMBER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("492.0, 763.85, 530.35, 773.85")').text()
        REGISTERED_NAME = pdf.pq('LTTextLineHorizontal:overlaps_bbox("61.0, 773.884, 198.8, 781.884")').text()
        COUNTRY_OF_ORIGIN = pdf.pq('LTTextLineHorizontal:overlaps_bbox("299.0, 609.85, 330.11, 619.85")').text()
        Item_Description = pdf.pq('LTTextLineHorizontal:overlaps_bbox("69.0, 429.918, 245.382, 435.918")').text()
        Commercial_InvoiceNo = pdf.pq('LTTextLineHorizontal:overlaps_bbox("112.0, 293.85, 179.83, 303.85")').text()
        RELEASE_DATE = pdf.pq('LTTextLineHorizontal:overlaps_bbox("495.0, 227.884, 535.032, 235.884")').text()
        IMPORTATION_DATE = pdf.pq('LTTextLineHorizontal:overlaps_bbox("424.0, 752.85, 515.74, 762.85")').text()
    # Combined all relevant information into single observation
        page = pd.DataFrame({
                            'IERD_ENTRY_NUMBER': IERD_ENTRY_NUMBER,
                            'REGISTERED_NAME': REGISTERED_NAME,
                            'COUNTRY_OF_ORIGIN': COUNTRY_OF_ORIGIN,
                            'Item_Description': Item_Description,
                            'Commercial_InvoiceNo': Commercial_InvoiceNo,
                            'RELEASE_DATE': RELEASE_DATE,
                            'IMPORTATION_DATE': IMPORTATION_DATE
                        }, index=[0])
        return(page)

    # Loop to all pages and output to CSV   
    pagecount = pdf.doc.catalog['Pages'].resolve()['Count']
    master = pd.DataFrame()
    for p in range(pagecount):
        pdf.load(p)
        page = pdfscrape(pdf)
        master = master.append(page, ignore_index=True)
    master.to_csv('output.csv', index = False)

But this only works for pdf file with the same format on all its pages. I workaround with my multiple pdf files with pdfsam by extracting the 1st page of all my pdfs and merging all the extracted 1st page so it will be just 1 pdf file with the same format in all its pages. I cannot get data of #8-9 with this because it is from the other page of each individual pdf.

My goal is to parse the #1-9 (will be the columns of my csv) data from multiple pdf files (will be the rows of the csv) on a folder to a csv file (see excel file).

I'm asking here because I am totally new to coding and I'm stuck, I recently stared reading and watching yt tutorials to learn python application for accounting.

Thank you!

Original Q&A

Extract specific information from multiple page pdf (and multiple pdf files from a folder) to CSV

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in CSV

Related Questions in PDF

Related Questions in ACCOUNTING

Trending Questions

Popular # Hahtags

Popular Questions