This question already
hi crawler4j

crawler4j detects lines between the <script> </script> tag as text

75 Views Asked by At

 <html>
 <head>
  
 </head>      
 <body> 
  <div style="width: 100%;"> This question already
  </div> 
  <div id="player"> hi crawler4j </div> 
  <script>
 player = new Clappr.Player({source: "http://123.30.215.65/hls/4545780bfa790819/5/3/d836ad614748cdab11c9df291254cf836f21144da20bf08142455a8735b328ca/dnR2MQ==_m.m3u8",
   parentId: '#player',
   width: '100%', height: "100%",
      hideMediaControl: true,
      autoPlay: true
             }); 
 </script>   
 </body>
</html>

<!-- begin snippet: js hide: false console: true babel: false -->

In the line of code that I give as an example above, I do the following;

HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String body = htmlParseData.getHtml();

crawler4j detects lines between the <script> </script> tag as text. I want to delete everything that is between the <script> </script> tag in the body variable and then do getText(). do you help me, please ?

I want to print this out :

This question already

hi crawler4j

1

There are 1 best solutions below

0
rzo1 On

HtmlParseData of crawler4j does not contain the full DOM tree of the fetched HTML page. For this reason, the plain HTML in its String representation is contained in the HtmlParseData object.

If you want to remove the content between the <script> tags, you can either

  1. Use regular expression to remove it as described on this Stackoverflow post
  2. Use JSoup (which is already a dependency of crawler4j to parse the DOM tree and remove the <script> tags from the resulting tree.