Scrape a Dynamic Website using Java with Selenium?

497 Views Asked by At

I'm trying to scrape https://www.rspca.org.uk/findapet#onSubmitSetHere to get a list of all pets for adoption.

I've built web scrapers before using crawler4j but the websites were static.

Since https://www.rspca.org.uk/findapet#onSubmitSetHere is not a static website, how can I scrape it? Is it possible? What technologies should I use and how?

Update:

When you fill in the search form (Select type of pet and Enter postcode/town or county) in the UI, the results are then displayed below the search box.

enter image description here

The red is highlighted as the search bar and the black is highlighted as results.

I'm trying to scrape the results and also the content of each result.

I've had a look at the request the browser makes to retrieve results, but from Chrome dev tools it isn't obvious what the request is being made.

1

There are 1 best solutions below

2
tgdavies On

You could use Selenium to extract information from the DOM once a browser has rendered it, but I think a simpler solution is to use "developer tools" to find the request that the browser makes when the "search" button is clicked, and try to reproduce that.

In this case that makes a POST to https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search

The body of the POST request contains a lot of parameters, including animalType and location. The content-type of the request is application/x-www-form-urlencoded.

To see these parameters, go to the "Network" tab in chrome dev tools, click on the "findapet" request (it's the first one in the list when I do this), and click on the "payload" tab to see the query string parameters and the form parameters (which contains animalType and location)

The response contains HTML.

I would try making a request to that endpoint and then parsing the HTML in the response.