I am writing a user script that runs on https://example.net and makes fetch requests for HTML documents from https://example.com that I want to parse into HTML DOM trees.
The fetch API only gives me the raw HTML source. I can parse it myself using DOMParser, but I run into a problem with relative links. Suppose the document from https://example.com contains something like this:
<!DOCTYPE html>
<html>
<head>
<body>
<p> <a href="/foo">hello!</a>
If I obtain the DOM node for that body > p > a element and read its href property, the value I obtain will be https://example.net/foo. This is because DOMParser assigns the source location of the ambient document to the parsing result. I want to assign it the actual source of the document so that relative links resolve correctly.
Right now the only workarounds I can think of are:
- inject a
<base>element into the DOM tree, which may interfere with a<base>tag present in the actual HTML source - use
document.implementation.createHTMLDocument()and then.write(), which gives me a document with a blank source location, where relative links are at least not resolved incorrectly (but will not be resolved at all). Except this doesn't work in a user script: it throws aSecurityError. - use
Proxyto intercept accesses to thehrefproperty, which seems too heavyweight to comfortably fit in a user script - include a userland HTML parser and DOM implementation, which again seems too burdensome
I also realise that parsing HTML from Unicode text obtained by .text() will bypass the HTML encoding detection algorithm. I can live with that myself, because the site I am interested in exclusively uses UTF-8 correctly denoted in headers, but this is also a flaw that should be noted. Ideally, an HTML document ought to be parsed directly from a Blob or even a ReadableStream.
Is there a better way to accomplish what I want?
Instead of using
fetch, useXMLHttpRequest, which has the built-in capability to parse HTML into aDocument.You have to explicitly request a document by assigning the string
"document"to theresponseTypeproperty of theXMLHttpRequestobject after callingopen()but before callingsend().In my tests relative URLs are converted to absolute URLs based on the source document.