I'm trying to scrape a number of webpages using newspaper3k and my program is throwing 503 Exceptions. Can anyone help me identify the reason for this and help me get around it? To be exact, I'm not looking to catch these exceptions but to understand why they are occurring and prevent them if possible.
from newspaper import Article
dates = list()
titles = list()
urls = ['https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-mps-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-05-06',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-fsr-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-03-04',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-2019-20-reserve-bank-annual-review',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-12-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-28',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-22',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-19',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-09-14']
for url in urls:
speech = Article(url)
speech.download()
speech.parse()
dates.append(speech.publish_date)
titles.append(speech.title)
Here's my Traceback:
---------------------------------------------------------------------------
ArticleException Traceback (most recent call last)
<ipython-input-5-217a6cafe26a> in <module>
20 speech = Article(url)
21 speech.download()
---> 22 speech.parse()
23 dates.append(speech.publish_date)
24 titles.append(speech.title)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in parse(self)
189
190 def parse(self):
--> 191 self.throw_if_not_downloaded_verbose()
192
193 self.doc = self.config.get_parser().fromstring(self.html)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in throw_if_not_downloaded_verbose(self)
529 raise ArticleException('You must `download()` an article first!')
530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531 raise ArticleException('Article `download()` failed with %s on URL %s' %
532 (self.download_exception_msg, self.url))
533
ArticleException: Article `download()` failed with 503 Server Error: Service Temporarily Unavailable
for url: https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
on URL https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
Here is how you can troubleshoot the
503 Server Error: Service Temporarily Unavailableerror with the Python Package Requests.Why are we getting a 503 Server Error?
Let's look at the content being returned by the server.
If we looked at the returned text we can see that the website is asking for your browser to complete a
challenge-form.. If you look at the additional data points (e.g.cf-content) in the text you can see that the website is being protected byCloudFlare.Bypassing this protection is extremely difficult. Here is one of my recent answers on the complexity bypassing this protection.
Can't scrape product title from a webpage