tag as text

75 Views Asked by mehmet akif kuş At 26 December 2019 at 07:22

 <html>
 <head>
  
 </head>      
 <body> 
  <div style="width: 100%;"> This question already
  </div> 
  <div id="player"> hi crawler4j </div> 
  <script>
 player = new Clappr.Player({source: "http://123.30.215.65/hls/4545780bfa790819/5/3/d836ad614748cdab11c9df291254cf836f21144da20bf08142455a8735b328ca/dnR2MQ==_m.m3u8",
   parentId: '#player',
   width: '100%', height: "100%",
      hideMediaControl: true,
      autoPlay: true
             }); 
 </script>   
 </body>
</html>

<!-- begin snippet: js hide: false console: true babel: false -->

In the line of code that I give as an example above, I do the following;

HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String body = htmlParseData.getHtml();

crawler4j detects lines between the <script> </script> tag as text. I want to delete everything that is between the <script> </script> tag in the body variable and then do getText(). do you help me, please ?

I want to print this out :

This question already

hi crawler4j

Original Q&A

There are 1 best solutions below

rzo1 On 22 April 2020 at 11:46

HtmlParseData of crawler4j does not contain the full DOM tree of the fetched HTML page. For this reason, the plain HTML in its String representation is contained in the HtmlParseData object.

If you want to remove the content between the <script> tags, you can either

Use regular expression to remove it as described on this Stackoverflow post
Use JSoup (which is already a dependency of crawler4j to parse the DOM tree and remove the <script> tags from the resulting tree.

crawler4j detects lines between the <script> </script> tag as text

There are 1 best solutions below

Related Questions in WEB-CRAWLER

Related Questions in HTML-PARSING

Related Questions in CRAWLER4J

Trending Questions

Popular # Hahtags

Popular Questions