How to crawl recursively in Colly?

52 Views Asked by At

I am trying to implement a crawler in Colly, which can crawl several articles from a news source. I have given two start URLs. The behaviour I want is that thorugh pagination and outlinks, Colly would keep crawling for webpages (potentially infinitely long).

However, the behaviour I am observing is that Colly stops its visiting process one level after. This is observed by the fact that Colly does not paginate beyond one page. I am certain the links, XPATH, etc is accurate. I also specified max depth as 10 while initializing the collector object, but it doesn't seem to work as expected.

Here is my code:

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {
    // Instantiate default collector

    c := colly.NewCollector(
        // Using IndonesiaX as sample
        colly.AllowedDomains("www.zerohedge.com"),

        // Cache responses to prevent multiple download of pages
        // even if the collector is restarted
        colly.CacheDir("./cache"),
        colly.MaxDepth(10),
        colly.Async(true),
    )

    // On every a element which has href attribute call callback
    c.OnXML("//h2[contains(@class,'Article_title')]//a[@href]", func(e *colly.XMLElement) {
        link := e.Attr("href")
        // fmt.Println(link)
        // start scraping the page under the link found
        e.Request.Visit(link)
    })

    c.OnXML("//div[contains(@class, 'SimplePaginator')]//a[@href]", func(e *colly.XMLElement) {
        link := e.Attr("href")
        // fmt.Println(link)
        // start scraping the page under the link found
        e.Request.Visit(link)
    })

    c.OnXML("//header[contains(@class, 'ArticleFull')]//h1/text()", func(e *colly.XMLElement) {
        // fmt.Println(e.Text)
        // start scraping the page under the link found
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
        r.Visit(r.URL.String())
    })

    c.Visit("https://www.zerohedge.com/covid-19")
    c.Visit("https://www.zerohedge.com/medical")

    c.Wait()
}

Any help in this would be appreciated.

1

There are 1 best solutions below

0
namefree nargrom On

After a bit of investigation. I found out that the element SimplePaginator does not allways exist. The website is doing some trickries to stop crawllers.