How to parse HTML with source mapping?

463 Views Asked by midrare At 26 September 2021 at 04:23

I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup.

For example, given the HTML markup (with \n EOL chars)

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
  <head>
    <title>No Longer Human</title>
    <meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.expected.resource"/>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
  <link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
  </head>
  <body class="calibre" aid="0">
  </body>
</html>

(example with BeautifulSoup, but I'm not attached to any parser in particular)

>>> soup = bs4.BeautifulSoup(html_markup)
>>> title_tag = soup.find('title')
>>> get_offsets_in_markup(title_tag)  # <-------- how do I go about doing this?
(109, 139)  # <----- source mapping info I want to get
>>> html_markup[109:139]
'<title>No Longer Human</title>'

I don't see this functionality in the APIs of any of the Python HTML parsers available. Can I hack it into one of the existing parsers? How would I go about doing that? Or is there another, better approach?

I realize that str(soup_element) serializes the element back into markup (and I can hypothetically recurse down the tree saving the start and end indices as I go), but the markup returned by doing that, although semantically equivalent to the original, doesn't match the original char-for-char. None of the available Python parsers do.

Original Q&A

There are 1 best solutions below

Rustam Garayev On 26 September 2021 at 07:57

You can use regular expression to find corresponding element's start and indexes, and use those indexes in original string to find data:

import re
from bs4 import BeautifulSoup
from pathlib import Path

def get_offsets_in_markup(tag, html_markup):
    elem = re.search(str(title_tag), html_markup)
    return elem.start(), elem.end()

html_markup = Path('test.html').read_text()
soup = BeautifulSoup(html_markup, 'lxml')

title_tag = soup.find('title')

indexes = get_offsets_in_markup(title_tag, html_markup)
# -> (109, 139)
given_text = html_markup[indexes[0]:indexes[1]]
# -> <title>No Longer Human</title>

This is how test.html looks like:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
  <head>
    <title>No Longer Human</title>
    <meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.e$
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
  <link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
  </head>
  <body class="calibre" aid="0">
  </body>
</html>

How to parse HTML with source mapping?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in BEAUTIFULSOUP

Related Questions in HTML-PARSING

Related Questions in ELEMENTTREE

Related Questions in HTML5LIB

Trending Questions

Popular # Hahtags

Popular Questions