Trying to get content of few websites using Scrapy, however all of them returns 403 (Forbidden) response code. Though same websites works fine when i make request using 'get' function as below:
import requests
url = "https://www.name_of_website.com/"
headers = {
"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
}
response = requests.get(url, headers=headers)
print(response.status_code)
Moreover, websites are accessible by chrome as usual. I tried to use same headers as used by chrome with Scrapy via DEFAULT_REQUEST_HEADERS and it still failed.
Not sure why scrapy fails while regular requests.get() works. This behavior is observed with many websites. I tried using scrapy-fake-useragent as well as middleware, however no success
Any clue or solution will be highly appreciated.
I see similar questions here, however it didn't help, so looking for fresh thoughts from experts in this area.
Thanks
Edit (answering to @ewoks and @Lakshmanrao Simhadri):
I am trying following urls for research purpose where as I mentioned the response code I am getting:
https://www.fastcompany.com/ - 403
https://www.ft.com/ - 200
https://www.theinformation.com/ - 200
https://www.pcmag.com/ - 403
https://www.thestreet.com/ - 403
None of them worked with scrapy.
My Scrapy code is simple as below:
class TheinformationSpider(scrapy.Spider):
name = "theinformation"
allowed_domains = ["www.theinformation.com"]
start_urls = ["https://www.theinformation.com/"]
def parse(self, response):
print(response)
I am right now just looking for response code.
Settings which I updated are as below:
DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (Linux; x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "http://www.google.com",
}
I am getting following response while crawling:
2024-03-08 15:15:54 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.theinformation.com/> (referer: http://www.google.com)
2024-03-08 15:15:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.theinformation.com/>: HTTP status code is not handled or not allowed
2024-03-08 15:15:54 [scrapy.core.engine] INFO: Closing spider (finished)
Total articles scrapped by "theinformation" = 0, null data = 0
I tried passing exact headers from chrome to make a scrapy request but still it failed. By using some proxy I'm able to get the response. Please have a look at the below solution and let me know your thoughts