PHP: Unicode nodeValue?

266 Views Asked by At

I am trying to extract all the link text and hrefs from an HTML string, but the source string is Unicode, and nodeValue doesn't seem to cope with this?

$links = array();
$titles = array();

$dom = new DOMDocument();
$dom->loadHTML( $str );
$hrefs = $dom->getElementsByTagName("a");
foreach ($hrefs as $href) {
    $links[] = $href->getAttribute("href");
    $titles[] = $href->nodeValue;
}

My source string looks like this:

<p><a href='uploads/root/tr_62.pdf'>Türkiye</a></p> 

But my output for $titles[0] looks like this:

Türkiye

How can I make nodeValue respect the Unicode characters?

Thanks for looking!

2

There are 2 best solutions below

1
Huynh Son On

You much using mb_convert_encoding

$dom = new DOMDocument();
$html_data  = mb_convert_encoding($str , 'HTML-ENTITIES', 'UTF-8'); 
$dom->loadHTML( $html_data  );
$hrefs = $dom->getElementsByTagName("a");
foreach ($hrefs as $href) {
    $links[] = $href->getAttribute("href");
    $titles[] = $href->nodeValue;
}
3
Neil Hillman On

Thanks, user Veve's comment answered my question.

The following line solves my issue:

$str = mb_convert_encoding( $str, 'html-entities', 'utf-8' );