How can I parse HTML into a DOM tree, taking into account its origin location?

Question

How can I parse HTML into a DOM tree, taking into account its origin location?

436 Views Asked by user3840170 At 26 March 2023 at 11:23

I am writing a user script that runs on https://example.net and makes fetch requests for HTML documents from https://example.com that I want to parse into HTML DOM trees.

The fetch API only gives me the raw HTML source. I can parse it myself using DOMParser, but I run into a problem with relative links. Suppose the document from https://example.com contains something like this:

<!DOCTYPE html>
<html>
  <head>
  <body>
    <p> <a href="/foo">hello!</a>

If I obtain the DOM node for that body > p > a element and read its href property, the value I obtain will be https://example.net/foo. This is because DOMParser assigns the source location of the ambient document to the parsing result. I want to assign it the actual source of the document so that relative links resolve correctly.

Right now the only workarounds I can think of are:

inject a <base> element into the DOM tree, which may interfere with a <base> tag present in the actual HTML source
use document.implementation.createHTMLDocument() and then .write(), which gives me a document with a blank source location, where relative links are at least not resolved incorrectly (but will not be resolved at all). Except this doesn't work in a user script: it throws a SecurityError.
use Proxy to intercept accesses to the href property, which seems too heavyweight to comfortably fit in a user script
include a userland HTML parser and DOM implementation, which again seems too burdensome

I also realise that parsing HTML from Unicode text obtained by .text() will bypass the HTML encoding detection algorithm. I can live with that myself, because the site I am interested in exclusively uses UTF-8 correctly denoted in headers, but this is also a flaw that should be noted. Ideally, an HTML document ought to be parsed directly from a Blob or even a ReadableStream.

Is there a better way to accomplish what I want?

Original Q&A

There are 2 best solutions below

Neha Soni On 30 March 2023 at 12:43

If you can inject the base elements into the DOM tree that would be the easiest approach.
However if think of another potential way, you can use the URL object to construct a new absolute URL based on the base URL of the document. For example-

const base = new URL('https://example.com');
const html = '<!DOCTYPE html><html><body><p><a href="/foo">hello!</a></p></body></html>';
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
const link = doc.querySelector('a');
const href = link.href;
const absUrl = new URL(href, base).href;
console.log(absUrl); // output: "https://example.com/foo"

In this way, you can assure that relative links are settled correctly without having to insinuate a base element into the DOM tree or use a userland HTML parser and DOM implementation.

**Dheeraj Vepakomma** · Accepted Answer · 2023-04-01T05:11:51.943000

Instead of using fetch, use XMLHttpRequest, which has the built-in capability to parse HTML into a Document.

You have to explicitly request a document by assigning the string "document" to the responseType property of the XMLHttpRequest object after calling open() but before calling send().

const xhr = new XMLHttpRequest();
xhr.onload = () => {
  console.log(
    Array.from(xhr.responseXML.links).map(({ href }) => href)
  );
}
xhr.open("GET", "https://example.com");
xhr.responseType = "document";
xhr.send();

In my tests relative URLs are converted to absolute URLs based on the source document.

How can I parse HTML into a DOM tree, taking into account its origin location?

There are 2 best solutions below

Related Questions in JAVASCRIPT

Related Questions in HTML

Related Questions in DOMPARSER

Trending Questions

Popular # Hahtags

Popular Questions