Converting PDF to Markdown in Python with structure preservation

1k Views Asked by At

I need to convert a PDF text document to Markdown and maintaining its structure (ie. indexed numbered headers and subheaders should have their correspective number of hashtags # in markdown to keep the same structure tree). I have explored alone PDFMinersix but I am basically extracting text and I don't see a functionality capable of mapping the structure tree to markdown format, or am I wrong?

For me it's important to convert the document to text and being able to retain structure tree hierarchy. Either in 1 or 2 steps is the same for me.

Any recommendations for Python libraries or best practices that have proven effective in similar scenarios? I am looking for a solution that could scale hundreds of documents and so possibly nothing hardcoded, even though the documents will actually share most of the structure and indexing.

1

There are 1 best solutions below

2
M.Mark On

Maybe Try llama_parse with result_type="markdown" - this worked for me

code:

from llama_parse import LlamaParse  # pip install llama-parse

parser = LlamaParse(
    api_key="...",  # you will need an API key, get it from https://cloud.llamaindex.ai/
    result_type="markdown"  # "markdown" and "text" are available
)

documents = parser.load_data("./my_file.pdf")