<html>
<head>
</head>
<body>
<div style="width: 100%;"> This question already
</div>
<div id="player"> hi crawler4j </div>
<script>
player = new Clappr.Player({source: "http://123.30.215.65/hls/4545780bfa790819/5/3/d836ad614748cdab11c9df291254cf836f21144da20bf08142455a8735b328ca/dnR2MQ==_m.m3u8",
parentId: '#player',
width: '100%', height: "100%",
hideMediaControl: true,
autoPlay: true
});
</script>
</body>
</html>
<!-- begin snippet: js hide: false console: true babel: false -->
In the line of code that I give as an example above, I do the following;
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String body = htmlParseData.getHtml();
crawler4j detects lines between the <script> </script> tag as text.
I want to delete everything that is between the <script> </script> tag in the body variable and then do getText().
do you help me, please ?
I want to print this out :
This question already
hi crawler4j
HtmlParseDataofcrawler4jdoes not contain the full DOM tree of the fetched HTML page. For this reason, the plain HTML in itsStringrepresentation is contained in theHtmlParseDataobject.If you want to remove the content between the
<script>tags, you can eitherJSoup(which is already a dependency ofcrawler4jto parse the DOM tree and remove the<script> tags from the resulting tree.