How to scrape the price from dynamically updated webpages?

163 Views Asked by At

I have a problem when i trying to scrape a price from dynamically updated web pages. I mean that lion's share of html code doesn't received using ways like UrlConnection, Jsoup, HtmlUnit. I don't know really much about web scraping, but I guess that problem is that internet shops like these: Auchan, Silpo use javascript and ajax to load main info about products. And in my opinion, the problem is in redirecting or deley that doesn't allow to get full loaded html file with all needed data. So, the question is how to scrape price from links above?

I have already tried several approaches:

  1. UrlConnection

        URL url;
        try {
            url = new URL("https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/");
            URLConnection con = url.openConnection();
            InputStream is = con.getInputStream();
            BufferedReader br = new BufferedReader(new InputStreamReader(is));
            String line;
            try(FileWriter fileWriter = new FileWriter("output.html")){
                while ((line = br.readLine()) != null) {
                    fileWriter.write(line+"\n");
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    

    Runs good, but return html without price data.

  2. Jsoup

Document document = null;
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
try {
    document = Jsoup.connect(link).get();
} catch (IOException e) {
    e.printStackTrace();
}
if (document != null) {
    try (FileWriter fileWriter = new FileWriter("output.html")) {
        fileWriter.write(document.toString());
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Returns the same.

3.HtmlUnit

    String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
    WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.setAjaxController(new NicelyResynchronizingAjaxController());
    webClient.waitForBackgroundJavaScriptStartingBefore(5000);

    HtmlPage htmlPage = null;
    try {
        htmlPage = webClient.getPage(link);
        webClient.waitForBackgroundJavaScript(5000);
    } catch (IOException e) {
        e.printStackTrace();
    }
    if (htmlPage!=null){
        try (FileWriter fileWriter = new FileWriter("output.html")) {
            fileWriter.write(Jsoup.parse(htmlPage.asXml()).toString());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

Returns a little bit more, including some javascripts tags, but still nothing usefull. Also, this code above throws so many exceptions, that they don't even fit in console.

I also tried to set up agents like this:

java.net.URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");

and this:

System.setProperty("http.agent", "")
1

There are 1 best solutions below

0
Rob Evans On

You need to use Chrome's Dev tools to view the HTTP requests/responses

The page loads up tons of javascript. This in turn churns out a whole load of HTTP requests and waits for the responses: the first that looks interesting is:

https://auchan.ua/graphql which is a POST request with an important http header referer: https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/ - The response body for the request is: {"data":{"urlResolver":{"type":"PRODUCT","id":297668}}}

Taking the product ID value and searching for it in the subsequent responses I found the product ID was contained. The responses are all escaped unicode characters but if you open the URLs in a browser the content is rendered.

This particular URL that starts with auchan.ua/graphql/?query=query%20getProductDetail... looked promising and sure enough the special_price matches whats displayed on the page. So you'd need to find a way of generating/extracting these URLs from the initial page source.

link to product details

You may also find this response I gave useful for processing JSON data.

The second shop you linked to requires a username/password but the process for getting the data will likely be very similar; use dev tools to view the http requests, work out where the price info is coming from (find the value in one of the responses) then try to recreate the same request from the initial URL and the response returned.

Good luck!