Smalot PDF Parser not working with preg_match_all

146 Views Asked by DSB At 07 January 2024 at 06:30

Im using https://github.com/smalot/pdfparser and its getText() method to parse a page of a PDF file that contains text (not image of text), but then when i try to apply a regex to this text it doesnt work.

Ive echoed the var containing the extracted text and everything seems ok, ive copied the text shown in the browser and stored it in a var but when i compare this new var to the one containing the original text they're not identical. Any leads?

$parser = new Parser();
$pdf = $parser->parseFile($pdfFile);
$pages = $pdf->getPages();

foreach ($pages as $page) {
  $fullPageText = $page->getText();
  echo $fullPageText; 
  echo gettype($fullPageText); //Prints string

  $copiedTextFromFullPageTextEcho = "...";
  echo $fullPageText === $copiedTextFromFullPageTextEcho ? "Yes" : "No"; //Prints No
  preg_match_all("/CANT\.\s+\S+\s+(.+?)\/.+\/(.+)-(.+)\s+(\d+)(\s+TOT AIS)?/", $fullPageText, $matches, PREG_SET_ORDER);
  print_r($matches); //Prints Array ()
}

Original Q&A

There are 1 best solutions below

Johan On 19 January 2024 at 14:24

Not sure in which encoding PDFParser returns the page text and faced the same problem today. Indeed the string returned by PDFParser will be different from what you print and copy from your browser or text file.

What helped me - is to replace multiple spaces in the PDFParser output: $page = preg_replace('/\s+/', ' ', $page); and then push this $page variable to preg_match

Smalot PDF Parser not working with preg_match_all

There are 1 best solutions below

Related Questions in PHP

Related Questions in STRING

Related Questions in PDF

Related Questions in PDFPARSER

Trending Questions

Popular # Hahtags

Popular Questions