How to parse HTML with source mapping?

463 Views Asked by At

I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup.

For example, given the HTML markup (with \n EOL chars)

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
  <head>
    <title>No Longer Human</title>
    <meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.expected.resource"/>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
  <link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
  </head>
  <body class="calibre" aid="0">
  </body>
</html>

(example with BeautifulSoup, but I'm not attached to any parser in particular)

>>> soup = bs4.BeautifulSoup(html_markup)
>>> title_tag = soup.find('title')
>>> get_offsets_in_markup(title_tag)  # <-------- how do I go about doing this?
(109, 139)  # <----- source mapping info I want to get
>>> html_markup[109:139]
'<title>No Longer Human</title>'

I don't see this functionality in the APIs of any of the Python HTML parsers available. Can I hack it into one of the existing parsers? How would I go about doing that? Or is there another, better approach?

I realize that str(soup_element) serializes the element back into markup (and I can hypothetically recurse down the tree saving the start and end indices as I go), but the markup returned by doing that, although semantically equivalent to the original, doesn't match the original char-for-char. None of the available Python parsers do.

1

There are 1 best solutions below

0
Rustam Garayev On

You can use regular expression to find corresponding element's start and indexes, and use those indexes in original string to find data:

import re
from bs4 import BeautifulSoup
from pathlib import Path

def get_offsets_in_markup(tag, html_markup):
    elem = re.search(str(title_tag), html_markup)
    return elem.start(), elem.end()

html_markup = Path('test.html').read_text()
soup = BeautifulSoup(html_markup, 'lxml')

title_tag = soup.find('title')

indexes = get_offsets_in_markup(title_tag, html_markup)
# -> (109, 139)
given_text = html_markup[indexes[0]:indexes[1]]
# -> <title>No Longer Human</title>

This is how test.html looks like:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
  <head>
    <title>No Longer Human</title>
    <meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.e$
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
  <link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
  </head>
  <body class="calibre" aid="0">
  </body>
</html>