I was writing a tool to convert HTML tables to CSV and I noticed some bizarre behavior. Given this code
$html = <<<HTML
<table>
<tr><td>A</td><td>Rose</td></tr>
</table>
<h1>Leave me behind</h1>
<table>
<tr><td>By</td><td>Any</td></tr>
</table>
<table>
<tr><td>Other</td><td>Name</td></tr>
</table>
HTML;
$dom = new \DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('table') as $table) {
foreach ($table->getElementsByTagName('tr') as $row) {
echo trim($row->nodeValue) . PHP_EOL;
}
}
I would expect output like this:
ARose
ByAny
OtherName
But what I get is this:
ARose
ByAny
OtherName
ByAny
OtherName
I get the same result if I omit the first closing tag. It appears DOMDocument is nesting the second and third <table> inside the first.
Indeed, if I use xpath to only get immediate children from each table I get the correct output:
$xpath = new \DOMXPath($dom);
foreach ($dom->getElementsByTagName('table') as $table) {
foreach ($xpath->query('./tr', $table) as $row) {
echo trim($row->nodeValue) . PHP_EOL;
}
}
Enclose your $html with
<body>and</body>Revised Code (Note: I commented out the
$streamlines)Alternatively, change
to