Parse self-closing tags missing the '/'

1.6k Views Asked by At

I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:

<element1>
    <element2 attr="0">
    <element3>Data</element3>
</element1>

When I parse the data, it ends up like:

<element1>
    <element2 attr="0">
        <element3>Data</element3>
    </element2>
</element1>

What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:

<element1>
    <element2 attr="0"/>
    <element3>Data</element3>
</element1>

Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.

1

There are 1 best solutions below

0
Ahndwoo On BEST ANSWER

What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.

Here's a boiled down version of what I'm doing:

import re
from lxml.html import soupparser
from lxml import etree as ET

empty_tags = ['elem1', 'elem2', 'elem3']

markup = """
<elem1 attr="some value">
<elem2/>
<elem3></elem3>
"""

for t in empty_tags:
    markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)

tree = soupparser.fromstring(markup)
print(ET.tostring(tree, pretty_print=True).decode("utf-8"))

The output should be:

<elem1 attr="some value"/>
<elem2/>
<elem3/>

(This will actually be enclosed in tags, but the parser adds those in.)

It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.

It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.