I'm trying to get the html structure of multiple websites using NodeJS, and I'm having difficulties. I want to get just the HTML structure of the document, and no content. I want to preserve classes, IDs, and other attributes.
Example of what I want back:
<title></title>
</head>
<body>
<h1></h1>
<div>
<div class="something">
<p></p>
</div>
</div>
</body>
Any suggestion on how to do this? Thanks
If OP tags his question:
Then why not use the TreeWalker API (available in all browsers.. since 2011)
You do not want to extract HTML tags...
You want to remove textNodes:
If you do have open shadowRoots, you need to recursively dive deeper into shadowDOMs