Parsel is not able to access nested elements

40 Views Asked by At

I am working with Parsel. Unfortunately, I am not able to parse <a> tag, which is child of another <a> tag (I know, that <a> inside <a> isn't HTML standard). How can I handle this situation via Parsel ? I have already solved this problem using Beautiful Soup + html.parser as a backend (Beatufiul Soup + lxml does not work as well).

from parsel import Selector

html_text = '''
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <a href="#">
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </a>
    </body>
    </html>
'''

selector = Selector(text=html_text)
print(selector.xpath('//a/a')) # `<class 'parsel.selector.SelectorList'>` is an empty...

If I put <a> inside <div> everything works fine. There is an example below:

from parsel import Selector

html_text = '''
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <div>
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </div>
    </body>
    </html>
'''

selector = Selector(text=html_text)
print(selector.xpath('//div/a')) # <class 'parsel.selector.SelectorList'> is not empty...
1

There are 1 best solutions below

3
Andrej Kesely On BEST ANSWER

The lxml.html parser that Parsel uses "fixes" the HTML code and puts the inner <a> outside. Try to specify type="xml" when instantiating the Selector:

from parsel import Selector

html_text = """
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <a href="#">
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </a>
    </body>
    </html>
"""

selector = Selector(text=html_text, type="xml")
# print how the Parsel parses the document:
# print(selector.getall()[0])
print(selector.xpath("//a/a"))

Prints:

[
 <Selector query='//a/a' data='<a id="test" href="image1.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image2.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image3.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image4.html">Name:...'>,
 <Selector query='//a/a' data='<a id="test" href="image5.html">Name:...'>
]