Prevent dom path query from trimming `<br>` tag

49 Views Asked by At

I am currently doing something like this:

$data = $xpath->query("//div[contains(concat (' ', normalize-space(@class), ' '), 'StationDisplay-module')] | //div[1]/div/div/div[3]/div/div/div[1]/div[3]/div[1]/div/div[2]/h3/ancestor::a");

Basically its working fine and returning text, from element with class name StationDisplay-module. However, it is trimming <br>tag.

For example if node value is this is a<br>dummy text, its returning this is adummy text

1

There are 1 best solutions below

0
Harshit Vaid On

So I am assuming you have some HTML and getting text from HTML with the help of XPath.

Xpath have some weird behaviour with HTML element. I have made a demo and assuming you have something like that.

$str = '
<tbody>
  <tr>
   <td class="StationDisplay-module">
        <div>
            <div>
                <h3>
                    <a id="test">19-10-2020 @ 17:33 <br> lllll</a>
                </h3>
            </div>
        </div>
        <div>test</div>
   </td>
   <td class="hidden-xs hidden-sm">
    <a href="#" data-identifier="5f8db1c332ea9b22d375b7c0 <br> gjhgjggjhgjh"></a>                                       
   </td>
  </tr>
</tbody>
';
$doc = new DOMDocument();
$doc->loadHTML($str);
$doc = simplexml_import_dom($doc);
$dates = $doc->xpath("//td[contains(concat (' ', normalize-space(@class), ' '), 'StationDisplay-module')] //div[1]/div/h3/a");
$identifiers = $doc->xpath('//td/a[@href]/@data-identifier');

foreach(array_combine($dates, $identifiers) as $date => $identifier) {
    echo trim($date) . "\n";
    echo trim($identifier) . "\n";
}

Here you can see when I tried with nested elements it trims <br> but in the second a tag it is returning text with <br> tag But it trims from first a tag.

So I think what we can do we can get the HTML first and replace all the br with some random string and when getting then we can replace back with br tag

Please check below example

$str = '
<tbody>
  <tr>
   <td class="StationDisplay-module">
        <div>
            <div>
                <h3>
                    <a id="test">19-10-2020 @ 17:33 <br> lllll</a>
                </h3>
            </div>
        </div>
        <div>test</div>
   </td>
   <td class="hidden-xs hidden-sm">
    <a href="#" data-identifier="5f8db1c332ea9b22d375b7c0 <br> gjhgjggjhgjh"></a>                                       
   </td>
  </tr>
</tbody>
';
$str = str_replace("<br>", "%@#$@", $str);
$doc = new DOMDocument();
$doc->loadHTML($str);
$doc = simplexml_import_dom($doc);
$dates = $doc->xpath("//td[contains(concat (' ', normalize-space(@class), ' '), 'StationDisplay-module')] //div[1]/div/h3/a");
$identifiers = $doc->xpath('//td/a[@href]/@data-identifier');

foreach(array_combine($dates, $identifiers) as $date => $identifier) {
    echo trim(str_replace("%@#$@", "<br>", $date)) . "\n";
    echo trim(str_replace("%@#$@", "<br>", $identifier)) . "\n";
}

If you are referring to some other issue please add more details to your question.

Hope this will help you.