C++: PDF parsing --> extract text --> podofo-0.10.3

109 Views Asked by At

I compiled PoDoFo 0.10.3 already successfully in Visual Studio 2022. Now I want to use this library to extract text from a PDF document, but I am struggling with the API. Even I can´t find any example how to do that...

void parseOneFile(const string_view& filename)
{
    PdfMemDocument document;

    document.Load(filename);
    
    // iterate over all pages of the whole pdf document
    for (int pn = 0; pn < document.GetPageCount(); ++pn) 
    {
        PoDoFo::PdfPage* page = document.GetPage(pn);
        // todo: ectract the text from the page

    }

Unfortunately the above code example is not working... (class PoDoFo::PdfMemDocument has no member GetPageCount)

Does anyone have an idea how to do this? I just want to extract the text and save it in a container like std::vector<std::string> for further processing.

Thank you!

1

There are 1 best solutions below

0
ThomasAlvaEdison On

After reading the API, I was able to write the following lines of code:

PdfMemDocument document;

document.Load(filename);
PoDoFo::PdfPageCollection& pagetree = document.GetPages();

for (int pn = 0; pn < pagetree.GetCount(); ++pn)
{
    PdfPage& curPdfPage = pagetree.GetPageAt(pn);
    
    PdfContents* pdfContent = curPdfPage.GetContents();

    PdfObject oneObject = pdfContent->GetObject();
    if (oneObject.IsArray())
    {
        PdfArray& array = oneObject.GetArray();
        for (auto& element : array)
        {
            std::cout << element.ToString() << std::endl;
        }
    }
    else if (oneObject.HasStream())
    {
        PdfObjectStream* stream = oneObject.GetStream();
    }
    else if (oneObject.IsDictionary())
    {
        PdfDictionary& dict = oneObject.GetDictionary();
 

    }

But I'm not sure if I'm on the right track... I still don't have the data / the text (of type std::string).