I need to process multiple files that are very old SHTML files that have been written using some malform HTML tags.
As an example, a given page will follow this structure
<!--#include virtual="../includes/header.shtml"-->
<title>Welcome</title>
<div class="fudgeLeft">
<div class="mainContent">
<link rel="stylesheet" href="../css/style.css">
<img src="hockeyflag.jpg" alt="">
<p>text
<p>text
<p>more text
</div>
<!--#include virtual="../includes/footer.shtml"-->
- The
header.shtmlincludes the opening tags of an HTML document up to and including the<body>tag. - The
footer.shtmlincludes the closing</div>s,</body>, and</html>. - Notice that each tag between the header and footer appears on different line and some tags are not closed properly.
[I honestly don't know what the original developer was thinking (or smoking) when he structured these pages.]
Anyways, I have written a script that scrubs these pages using DOMDocument, converts one specific tag, and saves the updated document as a new file.
The problem I am having is that the newly-created file has changed more than it should.
<!--#include virtual="../includes/header.shtml"--><title>Welcome</title><div class="fudgeLeft">
<div class="mainContent">
<link rel="stylesheet" href="../css/style.css" />
<img src="hockeyflag.jpg" alt="" />
<p>text</p>
<p>text</p>
<p>more text</p>
</div>
<!--#include virtual="../includes/footer.shtml"--></div>
- Notice now that some lines have been glued (not a big deal) but the tags have been closed. As well, one of the closing tags comes after the footer.
So my question is there a way to configure DOMDocument to leave the malform HTML as-is? My goal is to only change the one tag but keep the ugly document as it currently is.
My script is quite long but in short
$doc = new DOMDocument();
@$doc->loadHTMLFile('path-to-shtml-file', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// convert one tag
$doc->saveHTMLFile('path-to-new-shtml-file');
And I am running PHP 7.