that contains list of ~200 elements, like this:
" />
that contains list of ~200 elements, like this:
" />
that contains list of ~200 elements, like this:
"/>

Jsoup doesn't detect closing </div> tags in the list

104 Views Asked by At

I have an HTML string with <div style="position: relative;"> </div> that contains list of ~200 elements, like this:

 <div style="position: relative;">
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">1. Some Text</div>
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">2. Some Text</div>
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;"data-id="0_0_0_0">3. Some Text</div>
    ...
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;"data-id="0_0_0_0">200. Some Text</div>
 </div>

I do

Document document = Jsoup.parse(html)

I expect to get a document with a list, lile this:

<html>
 <head></head>
 <body>
   <div style="position: relative;">
      <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">1. Some Text</div>
      <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">2. Some Text</div>
      <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">3. Some Text</div>
 </body>
</html>

But Jsoup doesn't recognize closing tags of list elements and create document with dozens of nested div`s instead of a list, like this:

<html>
 <head></head>
 <body>
  <div style="\&quot;position:" relative;\>
   <div class="\&quot;episode-name\&quot;" style="\&quot;position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\&quot;0_0_0_0\&quot;">1. Some Unicode&lt;\/div&gt; <-- ORIGINAL </div> tag (its recognized like text?)
    <div class="\&quot;episode-name\&quot;" style="\&quot;position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\&quot;0_0_0_0\&quot;">2. Some Unicode&lt;\/div&gt;
     <div class="\&quot;episode-name\&quot;" style="\&quot;position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\&quot;0_0_0_0\&quot;">3. Some Unicode&lt;\/div&gt;
       </div> <-- the same </div> generated by Jsoup in the end of document instead of the item`s end
     </div>
   </div>
  </div>
 </body>
</html>

This messes up the DOM and makes parsing extremely difficult.
How can I get jsoup parse this fragment correctly?

2

There are 2 best solutions below

0
NikMAX On BEST ANSWER

When I tried to compose the string manually and passing it to jsoup I've found the root of the problem.
In the html string that the network request returns to me, I have a normal opening <div> tags, but escaped closing </div> tags in the list items. List items contains <\/div> instead of </div>.
List items looks like this:

<div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">1. Text<\/div> <--ESCAPED TAG

instead of this:

<div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">1. Text</div> <-- NORMAL TAG

So Jsoup interprets those escaped closing tags like text and generete incorrect DOM structure.

SOLUTION:
Adding a tag escape check before passing html to jsoup and replacing escaped tags with unescaped ones solved the problem. In my example I just use

html=html.replaceAll("<\\/div>", "</div>")
document = Jsoup.parse(html)
2
wannaBeDev On

I have no problems parsing that fragment using Jsoup 1.16.1. You should include the relevant part of your code if something is not working for you

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Example {

    public static void main(String[] args) {

        String html = " <div style=\"position: relative;\">\n"
                      + "    <div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">1. Some Text</div>\n"
                      + "    <div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">2. Some Text</div>\n"
                      + "    <div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\"data-id=\"0_0_0_0\">3. Some Text</div>\n"
                      + " </div>";

        Document document = Jsoup.parseBodyFragment(html);

        System.out.println(document.selectFirst("div.episode-name").text());
    }
}