How can I parse HTML into a DOM tree, taking into account its origin location?

436 Views Asked by At

I am writing a user script that runs on https://example.net and makes fetch requests for HTML documents from https://example.com that I want to parse into HTML DOM trees.

The fetch API only gives me the raw HTML source. I can parse it myself using DOMParser, but I run into a problem with relative links. Suppose the document from https://example.com contains something like this:

<!DOCTYPE html>
<html>
  <head>
  <body>
    <p> <a href="/foo">hello!</a>

If I obtain the DOM node for that body > p > a element and read its href property, the value I obtain will be https://example.net/foo. This is because DOMParser assigns the source location of the ambient document to the parsing result. I want to assign it the actual source of the document so that relative links resolve correctly.

Right now the only workarounds I can think of are:

  • inject a <base> element into the DOM tree, which may interfere with a <base> tag present in the actual HTML source
  • use document.implementation.createHTMLDocument() and then .write(), which gives me a document with a blank source location, where relative links are at least not resolved incorrectly (but will not be resolved at all). Except this doesn't work in a user script: it throws a SecurityError.
  • use Proxy to intercept accesses to the href property, which seems too heavyweight to comfortably fit in a user script
  • include a userland HTML parser and DOM implementation, which again seems too burdensome

I also realise that parsing HTML from Unicode text obtained by .text() will bypass the HTML encoding detection algorithm. I can live with that myself, because the site I am interested in exclusively uses UTF-8 correctly denoted in headers, but this is also a flaw that should be noted. Ideally, an HTML document ought to be parsed directly from a Blob or even a ReadableStream.

Is there a better way to accomplish what I want?

2

There are 2 best solutions below

0
Dheeraj Vepakomma On BEST ANSWER

Instead of using fetch, use XMLHttpRequest, which has the built-in capability to parse HTML into a Document.

You have to explicitly request a document by assigning the string "document" to the responseType property of the XMLHttpRequest object after calling open() but before calling send().

const xhr = new XMLHttpRequest();
xhr.onload = () => {
  console.log(
    Array.from(xhr.responseXML.links).map(({ href }) => href)
  );
}
xhr.open("GET", "https://example.com");
xhr.responseType = "document";
xhr.send();

In my tests relative URLs are converted to absolute URLs based on the source document.

0
Neha Soni On

If you can inject the base elements into the DOM tree that would be the easiest approach.
However if think of another potential way, you can use the URL object to construct a new absolute URL based on the base URL of the document. For example-

const base = new URL('https://example.com');
const html = '<!DOCTYPE html><html><body><p><a href="/foo">hello!</a></p></body></html>';
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
const link = doc.querySelector('a');
const href = link.href;
const absUrl = new URL(href, base).href;
console.log(absUrl); // output: "https://example.com/foo"

In this way, you can assure that relative links are settled correctly without having to insinuate a base element into the DOM tree or use a userland HTML parser and DOM implementation.