I try to get all decrees of the Federal Supreme Court of Switzerland available at: https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=&to_date=&x=12&y=12 Unfortunately, no API is provided. The CSS selectors of the data I want to retrieve is .para
I am aware of http://relevancy.bger.ch/robots.txt.
User-agent: *
Disallow: /javascript
Disallow: /css
Disallow: /hashtables
Disallow: /stylesheets
Disallow: /img
Disallow: /php/jurivoc
Disallow: /php/taf
Disallow: /php/azabvger
Sitemap: http://relevancy.bger.ch/sitemaps/sitemapindex.xml
Crawl-delay: 2
To me it looks like the URL i am looking at is allowed to crawl, is that correct? Whatever, the federal cort explains that these rules are targeted to big search engines and individual crawling is tolerated.
I can retrieve the data for a single decree (using https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/)
url <- 'https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&page=1&from_date=&to_date=&sort=relevance&insertion_date=&top_subcollection_aza=all&query_words=&rank=1&azaclir=aza&highlight_docid=aza%3A%2F%2F18-12-2017-6B_790-2017&number_of_ranks=113971'
webpage <- read_html(url)
decree_html <- html_nodes(webpage,'.para')
rank_data <- html_text(decree_html)
decree1_data <- html_text(decree_html)
However, since rvest extracts data from only one specific page and my data is on multiple pages, I tried Rcrawler to do so (https://github.com/salimk/Rcrawler), but I do not know how to crawl the given site structur on www.bger.ch to get all URLs with the decrees.
I checked out following posts, but could still not find a solution:
I don't do error handling below since that's beyond the scope of this question.
Let's start with the usual suspects:
We'll define a function that will get us a page of search results by page number. I've hard-coded the search parameters since you provided the URL.
In this function, we:
It's pretty straightforward:
Make a helper function since I can't stand typing
attr(...)and it reads better in use:Now, make a scraping loop. I stop at 6 just b/c. You should remove that logic for scraping everything. Consider doing this in batches since internet connections are unstable things:
Turn the list of data frames into one big one. NOTE: You should do validity tests before this since web scraping is fraught with peril. You should also save off this data frame to an RDS file so you don't have to do it again.
With all the link in hand, we'll get the documents.
Define a helper function. NOTE we aren't parsing here. Do that separately. We'll store the inner content
<div>HTML text so you can parse it later.Here's how to use it. Again, remove
head()but also consider doing it in batches.You still need error & validity checking in various places and may need to re-scrape pages if there are server errors or parsing issues. But this is how to build a site-specific crawler of this nature.