lxml can not parse any html content that contains the character .
The python code below can not find the html element by xpath. Further more the result of etree.tostring(root) contails many extra whitespaces.
code:
from lxml import html, etree
text = """<div id="content">
</div>
"""
root = html.document_fromstring(text)
print(etree.tostring(root))
content = root.xpath("//div[@id='content']")
print(content)
Output:
b'<html><body><p>d i v i d = " c o n t e n t " > \n 1\x14/p></body></html>'
[]
Update: I believe this is due to a lxml bug. It has been fixed in lxml 4.4.3. However after checking lxml's changelog & commit history between 4.4.2-4.4.3, I still don't know the root cause.
ElementTree based working solution below
output
x