new to pdf parsing.
I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that it is generated from word (not scanned).
Input - pdf with a graph such as this one. output should be - true or false
pdfplumber recognize tables but doesn't seem to recognize graphs. tried recognizing curves and rectangles but results are not consistent.
maybe there's another way?
Thank you!
option 1:
(thanks to @KJ comment) I ended up using some bulk estimations to understand if the page contains a graph or not.
If there're more than MIN_RECTS in a page I assume there's a graph there (with columns that precived as rectengels) or if there's more than MIN_CURVES than there's a graph (for me it was 0, but it depends if you have some non-trivial shapes in the header or footer). It's not the best but it works most of the time.
example for some code - using both functions and extract_text() afterwards leads to pretty good results for me.
option 2:
following @G5W's comment, it is possible to convert PDF to MS Word file with pywin32 to read the PDF into Word, then use extract text only with python-docx for example.