Match up links with text using PyMuPDF?

42 Views Asked by At

I would like to extract text and links from PDF files using PyMuPDF. I have extracted the links using page.get_links() but what is the best method for matching the links with the text from page.get_text()?

1

There are 1 best solutions below

2
Timeless On

IIUC, you can extract all the Rect elements containing a "uri" and pass it to get_textbox :

import fitz  # pymupdf

links = {}
with fitz.open("input.pdf") as doc:
    for page in doc:
        links[page.number + 1] = {
            page.get_textbox(d["from"]).strip("."): d["uri"]
            for d in page.get_links()
        }

Output :

{
    1: {
        "StackOverflow": "https://stackoverflow.com/",
        "Meta": "https://meta.stackoverflow.com/",
        "GIS Exchange": "https://gis.stackexchange.com/",
    }
}

Used (input.pdf) :

enter image description here