How to extract a websites HTML tags in DOM and shadowDOM

Question

How to extract a websites HTML tags in DOM and shadowDOM

580 Views Asked by Brad At 20 October 2022 at 20:23

I'm trying to get the html structure of multiple websites using NodeJS, and I'm having difficulties. I want to get just the HTML structure of the document, and no content. I want to preserve classes, IDs, and other attributes.

Example of what I want back:

<title></title>
</head>
<body>
  <h1></h1>
  <div>
    <div class="something">
      <p></p>
    </div>
  </div>
</body>

Any suggestion on how to do this? Thanks

Original Q&A

There are 4 best solutions below

Mina On 20 October 2022 at 20:31

One solution is match the opening and closing tags with a regex /<\/?.*?>/g which will produce an array with all opening and closing tags without the content and then join the array.

const html = `<html>
<head>
 <title>title</title> 
</head>
<body>
  <h1>header</h1>
  <div>
    <div class="something">
      <p>paragrapth</p>
    </div>
  </div>
</body>
</html>`

const result = html.match(/<\/?.*?>/g).join('');

console.log(result)

IT goldman On 20 October 2022 at 20:52

Basically you want to remove all text nodes. Time to traverse the elements.

But first, we load the html string using DOMParser.

var EnglishCharFixer = {

  do_elem: function(elem) {
    var nodes = this.textNodesUnder(elem);
    this.process_text_nodes(nodes)
    return elem;
  },

  textNodesUnder: function(node) {
    var all = [];
    for (node = node.firstChild; node; node = node.nextSibling) {
      if (node.nodeType == 3) {
        all.push(node);
      } else {
        all = all.concat(this.textNodesUnder(node));
      }
    }
    return all;
  },


  process_text_nodes: function(nodes) {
    for (var index = 0; index < nodes.length; index++) {
      var node = nodes[index];
      node.nodeValue = ""
    }
  }

}


const htmlString = `
<html>
<head>
  <scr` + `ipt>var x=12</scr` + `ipt>
</head>
<body>
  <h1>this is test</h1>
  <div>
    <p>THIS IS TEXT THAT SHOULDN'T BE IN OUTPUT</p>
  </div> 
</body>
</html>
`;

function removeContentKeepStructure(html) {
  const parser = new DOMParser();
  const doc3 = parser.parseFromString(html, "text/html");
  EnglishCharFixer.do_elem(doc3.documentElement);
  var result = doc3.documentElement.outerHTML;
  return result;
}


console.log(removeContentKeepStructure(htmlString))

Ronnie Royston On 21 October 2022 at 01:06

Using recursion to simply clear .textContent from each node and then finishing with the .outerHTML property works well.

<html>
    <head>
        <title>This is <span>the title</span></title>
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
    </head>
    <body class="my-class">
        <main id="rt">
          <h1>This is a header</h1>
          <div>
            <div class="something">
              <p>This is a <span>paragraph</span></p>
            </div>
            <div id="shadow-rt">
                <div>
                    <span id="shadow-dom-child"></span>
                </div>
            </div>
          </div>
        </main>
    </body>
        <script>
            function walkTree(node) {
              if (node === null) {
                return;
              }
              // do something with node
              for (let i = 0; i < node.childNodes.length; i++) {
                walkTree(node.childNodes[i]);
              }
              if(node.textContent){
                node.textContent = "";
              }
            }
            document.getElementById("rt").attachShadow({mode: 'closed'});
            walkTree(document.getElementById("rt"));
            console.log(document.getElementsByTagName("HTML")[0].outerHTML);
        </script>
</html>

**Danny '365CSI' Engelman** · Accepted Answer · 2022-10-21T11:10:22.587000

If OP tags his question:

Then why not use the TreeWalker API (available in all browsers.. since 2011)

You do not want to extract HTML tags...

You want to remove textNodes:

  function removeTextNodes( root = document.body ) {
    let node,tree = document.createTreeWalker(root, NodeFilter.SHOW_TEXT);
    while (node = tree.nextNode()) node.textContent = "";
    return root.outerHTML;
  }

If you do have open shadowRoots, you need to recursively dive deeper into shadowDOMs

How to extract a websites HTML tags in DOM and shadowDOM

There are 4 best solutions below

Related Questions in JAVASCRIPT

Related Questions in HTML

Related Questions in DOM

Related Questions in SHADOW-DOM

Related Questions in TREEWALKER

Trending Questions

Popular # Hahtags

Popular Questions