Im using https://github.com/smalot/pdfparser and its getText() method to parse a page of a PDF file that contains text (not image of text), but then when i try to apply a regex to this text it doesnt work.
Ive echoed the var containing the extracted text and everything seems ok, ive copied the text shown in the browser and stored it in a var but when i compare this new var to the one containing the original text they're not identical. Any leads?
$parser = new Parser();
$pdf = $parser->parseFile($pdfFile);
$pages = $pdf->getPages();
foreach ($pages as $page) {
$fullPageText = $page->getText();
echo $fullPageText;
echo gettype($fullPageText); //Prints string
$copiedTextFromFullPageTextEcho = "...";
echo $fullPageText === $copiedTextFromFullPageTextEcho ? "Yes" : "No"; //Prints No
preg_match_all("/CANT\.\s+\S+\s+(.+?)\/.+\/(.+)-(.+)\s+(\d+)(\s+TOT AIS)?/", $fullPageText, $matches, PREG_SET_ORDER);
print_r($matches); //Prints Array ()
}
Not sure in which encoding PDFParser returns the page text and faced the same problem today. Indeed the string returned by PDFParser will be different from what you print and copy from your browser or text file.
What helped me - is to replace multiple spaces in the PDFParser output: $page = preg_replace('/\s+/', ' ', $page); and then push this $page variable to preg_match