I want to extract the following information from the attached sample pdf file.
- Exporter/supplier Name
- Customs Reference No
- Country of Origin
- Item Description
- Invoice No
- Assessment Date
- Import/Registration Date
- Taxable Value
- Total VAT
to be done on multiple pdf files and they have the same per page format.
In this drive folder are my reference files - https://drive.google.com/drive/folders/1QjSkC3oiTGSwrOdCEurI9aCN9awOa1C-?usp=sharing'
Currently, what I'm doing is using this code to save only # 1 to 7. I used pdfquery and pandas.
import pdfquery
import pandas as pd
file = 'PDFproject/65140.pdf'
pdf = pdfquery.PDFQuery(file)
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
def pdfscrape(pdf):
# Extract each relevant information individually
IERD_ENTRY_NUMBER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("492.0, 763.85, 530.35, 773.85")').text()
REGISTERED_NAME = pdf.pq('LTTextLineHorizontal:overlaps_bbox("61.0, 773.884, 198.8, 781.884")').text()
COUNTRY_OF_ORIGIN = pdf.pq('LTTextLineHorizontal:overlaps_bbox("299.0, 609.85, 330.11, 619.85")').text()
Item_Description = pdf.pq('LTTextLineHorizontal:overlaps_bbox("69.0, 429.918, 245.382, 435.918")').text()
Commercial_InvoiceNo = pdf.pq('LTTextLineHorizontal:overlaps_bbox("112.0, 293.85, 179.83, 303.85")').text()
RELEASE_DATE = pdf.pq('LTTextLineHorizontal:overlaps_bbox("495.0, 227.884, 535.032, 235.884")').text()
IMPORTATION_DATE = pdf.pq('LTTextLineHorizontal:overlaps_bbox("424.0, 752.85, 515.74, 762.85")').text()
# Combined all relevant information into single observation
page = pd.DataFrame({
'IERD_ENTRY_NUMBER': IERD_ENTRY_NUMBER,
'REGISTERED_NAME': REGISTERED_NAME,
'COUNTRY_OF_ORIGIN': COUNTRY_OF_ORIGIN,
'Item_Description': Item_Description,
'Commercial_InvoiceNo': Commercial_InvoiceNo,
'RELEASE_DATE': RELEASE_DATE,
'IMPORTATION_DATE': IMPORTATION_DATE
}, index=[0])
return(page)
# Loop to all pages and output to CSV
pagecount = pdf.doc.catalog['Pages'].resolve()['Count']
master = pd.DataFrame()
for p in range(pagecount):
pdf.load(p)
page = pdfscrape(pdf)
master = master.append(page, ignore_index=True)
master.to_csv('output.csv', index = False)
But this only works for pdf file with the same format on all its pages. I workaround with my multiple pdf files with pdfsam by extracting the 1st page of all my pdfs and merging all the extracted 1st page so it will be just 1 pdf file with the same format in all its pages. I cannot get data of #8-9 with this because it is from the other page of each individual pdf.
My goal is to parse the #1-9 (will be the columns of my csv) data from multiple pdf files (will be the rows of the csv) on a folder to a csv file (see excel file).
I'm asking here because I am totally new to coding and I'm stuck, I recently stared reading and watching yt tutorials to learn python application for accounting.
Thank you!