Extract specific information from multiple page pdf (and multiple pdf files from a folder) to CSV

383 Views Asked by At

I want to extract the following information from the attached sample pdf file.

  1. Exporter/supplier Name
  2. Customs Reference No
  3. Country of Origin
  4. Item Description
  5. Invoice No
  6. Assessment Date
  7. Import/Registration Date
  8. Taxable Value
  9. Total VAT

to be done on multiple pdf files and they have the same per page format.

In this drive folder are my reference files - https://drive.google.com/drive/folders/1QjSkC3oiTGSwrOdCEurI9aCN9awOa1C-?usp=sharing'

Currently, what I'm doing is using this code to save only # 1 to 7. I used pdfquery and pandas.

    import pdfquery
    import pandas as pd

    file = 'PDFproject/65140.pdf'

    pdf = pdfquery.PDFQuery(file)
    pdf.load()
    pdf.tree.write('pdfXML.txt', pretty_print = True)

    def pdfscrape(pdf):
    # Extract each relevant information individually
        IERD_ENTRY_NUMBER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("492.0, 763.85, 530.35, 773.85")').text()
        REGISTERED_NAME = pdf.pq('LTTextLineHorizontal:overlaps_bbox("61.0, 773.884, 198.8, 781.884")').text()
        COUNTRY_OF_ORIGIN = pdf.pq('LTTextLineHorizontal:overlaps_bbox("299.0, 609.85, 330.11, 619.85")').text()
        Item_Description = pdf.pq('LTTextLineHorizontal:overlaps_bbox("69.0, 429.918, 245.382, 435.918")').text()
        Commercial_InvoiceNo = pdf.pq('LTTextLineHorizontal:overlaps_bbox("112.0, 293.85, 179.83, 303.85")').text()
        RELEASE_DATE = pdf.pq('LTTextLineHorizontal:overlaps_bbox("495.0, 227.884, 535.032, 235.884")').text()
        IMPORTATION_DATE = pdf.pq('LTTextLineHorizontal:overlaps_bbox("424.0, 752.85, 515.74, 762.85")').text()
    # Combined all relevant information into single observation
        page = pd.DataFrame({
                            'IERD_ENTRY_NUMBER': IERD_ENTRY_NUMBER,
                            'REGISTERED_NAME': REGISTERED_NAME,
                            'COUNTRY_OF_ORIGIN': COUNTRY_OF_ORIGIN,
                            'Item_Description': Item_Description,
                            'Commercial_InvoiceNo': Commercial_InvoiceNo,
                            'RELEASE_DATE': RELEASE_DATE,
                            'IMPORTATION_DATE': IMPORTATION_DATE
                        }, index=[0])
        return(page)

    # Loop to all pages and output to CSV   
    pagecount = pdf.doc.catalog['Pages'].resolve()['Count']
    master = pd.DataFrame()
    for p in range(pagecount):
        pdf.load(p)
        page = pdfscrape(pdf)
        master = master.append(page, ignore_index=True)
    master.to_csv('output.csv', index = False)

    

But this only works for pdf file with the same format on all its pages. I workaround with my multiple pdf files with pdfsam by extracting the 1st page of all my pdfs and merging all the extracted 1st page so it will be just 1 pdf file with the same format in all its pages. I cannot get data of #8-9 with this because it is from the other page of each individual pdf.

My goal is to parse the #1-9 (will be the columns of my csv) data from multiple pdf files (will be the rows of the csv) on a folder to a csv file (see excel file).

I'm asking here because I am totally new to coding and I'm stuck, I recently stared reading and watching yt tutorials to learn python application for accounting.

Thank you!

0

There are 0 best solutions below