c# - PdfDocument.GetTextWithFormatting() does not take all pages

1.3k Views Asked by At

I'm trying to open a big PDF file but with this code

using BitMiracle.Docotic.Pdf;

PdfDocument pdf = new PdfDocument("document.pdf")
string document = pdf.GetTextWithFormatting();

the string document take the firsts 87 pages (of 174). Why it takes only the first half of the document?

EDIT: This is an evaluation mode restrictions of the library. There are some alternatives?

2

There are 2 best solutions below

0
Bobrovsky On BEST ANSWER

The behavior you observe is because of evaluation mode restrictions. When used in trial mode, the library imposes the following restrictions:

  • Documents generated with the library contain an evaluation notice that is printed across each page.
  • For all existing documents only half of the pages get read by the library.

To evaluate the library without the evaluation mode restrictions you can get a free time-limited license on our site.

5
Alexander Higgins On

You can try reading the text from each page:

StringBuilder sb = new StringBuilder();
var options = new PdfTextExtractionOptions
                {
                    WithFormatting = false,
                    SkipInvisibleText = true
                };
using (PdfDocument pdf = new PdfDocument("document.pdf"))
{
    int pageIndex = 1;
    foreach(var page in pdf.Pages)
    {
        Console.WriteLine("Page {0}", pageIndex++);
        sb.AppendLine(page.GetText(options));
    }
}
string allText = sb.ToString();

After doing this you should see a line in your console for every page in the pdf.

I could be that pages after 87 don't have text on them. For example, they could be images of scanned pages.

You can test this by trying to select and copy and paste text from the PDF after page 87. If you can then odds are it is a bug in the BitMiracle DLL.