How to Webscrape from React Client with <script>s Being Run?

80 Views Asked by At

I would like to web-scrape a website the user inputs on an input field to look for all the links on a page, in order to recursively explore the website. However, since nowadays most websites use JS, I would like to first run all of the <script>s on the website before querying for the links (<a> or document.links).

From the client side, and with React, I couldn't find anything doing this, perhaps I'm not that good at Google searching after all. In general, this gets stuck in security issues, i.e. cross-site scripting.

Is there no client-side JS package for creating a safe virtual DOM — not talking about React's here! — for executing the other website's <script>s?

So far, what I've tried:

  • Puppeteer: only works on the server side, as far as I know.
  • jsdom: creates the virtual DOM I was looking for, but also only works on the server side.
  • <iframe>: due to cross-site scripting, you just can't access its inner HTML if the src is on another domain.
  • DOMParser: also due to cross-site scripting, <script>s are marked as non-executable.
1

There are 1 best solutions below

3
Hermanboxcar On

Interesting dilemma.

This may just be a framework for your application, with some details whittled out, but here is an idea: Open a node.js application, use express to host a webport. Use this library: https://www.npmjs.com/package/node-iframe to load in an iframe of the site you want to be scraping. This library also helps you bypass CORS so you can load just about any site. Then, use client side code to get the body of the iframe:

let frameObj = document.getElementById(frameID);
let frameContent = frameObj.contentWindow.document.body.innerHTML;

(code courtesy of https://www.tutorialspoint.com/How-to-get-the-body-s-content-of-an-iframe-in-JavaScript)

Be sure to do this after a delay, to ensure that the js has loaded everything in before you get the innerHTML. Then relay the text to your backend in some way, one potential example is using a post request from the frontend with the innerHTML, to another page like /recieve and catch that using something like this:

const bodyParser = require('body-parser');
app.post('/recieve', (req, res) => {
    const receivedVariable = req.body.variable;
    console.log('Received Variable:', receivedVariable);
res.send('success');
});

Keep in mind that you will need three packages: node-iframe, body-parser, and express to achieve this.

I know that this is not in any way a "conventional" web scraper, and may be slow compared to other options, but I believe this is a relatively risk-free as it runs the website in an iframe in the browser, instead of running script code in the backend, which would be like using eval in a scraper.

For your recursive aspect: You can scrape by getting the innerHTML every second or something using setInterval on the frontend, and so record the innerHTML changes as the user interacts with the website (in the iframe)