I am trying to extract text shown on a webpage. I want the equivalent of manually copying the text and pasting it into Notepad (no formatting)
The problem appears to be the text shown on the webpage is generated by a number of scripts and despite my best efforts I am unable to access the text produced by these scripts
The following code downloads the webpage showing the scripts
private fun getWebpageUsingJsoup(url: String): String {
val document = Jsoup.connect(url).get()
document.select("script").forEach {
Log.d(TAG, "Found script $it")
}
return document.body().text()
}
The above code identifies the scripts but does not attempt to execute them
<body>
<script src="runtime.js" type="module"></script>
<script src="polyfills.js" type="module"></script>
<script src="vendor.js" type="module"></script>
<script src="main.js" type="module"></script>
</body>
Following a previous post How to use ScriptEngineManager in Android? I add code to execute the scripts
val engine = ScriptEngineManager().getEngineByName("rhino")
document.select("script[type=module]").forEach { // note: type=module required to avoid the GoogleAnalytics script
val script = Jsoup.parseBodyFragment(it.data())
Log.d(TAG, "Attempting to execute script $it")
val returnValue = engine.eval(script.html())
Log.d(TAG, "Engine returns\n$returnValue")
}
Disappointingly the output of the four scripts is
<html>
<head/>
<body/>
</html>
I am fresh out of ideas and appear no closer to obtaining the text shown on the webpage
Can anyone suggest how to obtain the text ?
For completeness these are the dependencies used to support Jsoup and ScriptEngineManager()
implementation 'org.jsoup:jsoup:1.15.4'
implementation 'io.apisense:rhino-android:1.0'
While this is not (yet) a complete answer I have managed to get the website text
The process starts with WebView with JavaScript enabled
When the page loads the onPageFinished() method invokes evaluateJavascript() to extract the document.body.textContent
The solution is not perfect because it removes ALL formatting information, including the line feeds which are included when manually copying and pasting the webpage
Can anyone suggest improvements to retain the formatting?