I am working on a project which requires parsing a PDF hosted online for relevant data. The pdf can be found here. I started by using tabula-py to parse the table.
However, because it is in a flat-file structure (repeating redundant data is instead left as a blank cell for readability) difficulties have presented themselves, such as data leaking to adjacent columns when being parsed.
I decided to try to post-process the PDF after downloading it, instead of just accessing it through the internet so that I could add vertical lines to the PDF. The vertical lines would be placed such that it created a lattice structure for the table, making it easier to parse and read.
I have tried to use the pypdf library to do this, as you can see below:
from pypdf import PdfWriter, PdfReader
from pypdf.annotations import Line
def addVerticalLines(inputPath, outputPath, xCoordinates):
reader = PdfReader(inputPath)
writer = PdfWriter()
# Iterate through each page in the input PDF
for i, page in enumerate(reader.pages):
writer.add_page(page)
for x in xCoordinates:
#Create the line
annotation = Line(
rect=(x, 0, x + 1, page.mediabox[3]),
p1=(x, 0),
p2=(x, page.mediabox[3])
)
writer.add_annotation(page_number=i, annotation=annotation)
# Write the modified PDF to the output file
with open(outputPath, 'wb') as outputFile:
writer.write(outputFile)
The result was an outputted pdf which looked the exact same as the downloaded PDF. On top of that, I got warnings in the console from the pypdf library saying:
Incorrect first char in NameObject:(None)
The number of times that the warnings appear is dependent on the number of pages in the pdf and the number of annotations. The warnings appear when the new PDF is saved to disk.
Is this a known issue? Is there a different library I should use instead? I would appreciate any assistance in drawing vertical lines on a local PDF. Thank you!
Here is a PyMuPDF solution.
Draws a vertical line. Not all possible parameters shown, like dashing pattern, putting in background / foreground or Optional Content visibility options.
Note: I am a maintainer and the original creator of PyMuPDF.