Getting 403 error with Scrapy request, though python 'get' request works fine

90 Views Asked by At

Trying to get content of few websites using Scrapy, however all of them returns 403 (Forbidden) response code. Though same websites works fine when i make request using 'get' function as below:

import requests
url = "https://www.name_of_website.com/"
headers = {
    "User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
}
response = requests.get(url, headers=headers)
print(response.status_code)

Moreover, websites are accessible by chrome as usual. I tried to use same headers as used by chrome with Scrapy via DEFAULT_REQUEST_HEADERS and it still failed.

Not sure why scrapy fails while regular requests.get() works. This behavior is observed with many websites. I tried using scrapy-fake-useragent as well as middleware, however no success

Any clue or solution will be highly appreciated.

I see similar questions here, however it didn't help, so looking for fresh thoughts from experts in this area.

Thanks

Edit (answering to @ewoks and @Lakshmanrao Simhadri):

I am trying following urls for research purpose where as I mentioned the response code I am getting:

https://www.fastcompany.com/        -  403
https://www.ft.com/                 -  200
https://www.theinformation.com/     -  200
https://www.pcmag.com/              -  403
https://www.thestreet.com/          -  403

None of them worked with scrapy.

My Scrapy code is simple as below:

class TheinformationSpider(scrapy.Spider):
    name = "theinformation"
    allowed_domains = ["www.theinformation.com"]
    start_urls = ["https://www.theinformation.com/"]

    def parse(self, response):
       print(response)

I am right now just looking for response code.

Settings which I updated are as below:

DEFAULT_REQUEST_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Linux; x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "http://www.google.com",
}

I am getting following response while crawling:

2024-03-08 15:15:54 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.theinformation.com/> (referer: http://www.google.com)
2024-03-08 15:15:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.theinformation.com/>: HTTP status code is not handled or not allowed
2024-03-08 15:15:54 [scrapy.core.engine] INFO: Closing spider (finished)
Total articles scrapped by "theinformation" = 0, null data = 0
1

There are 1 best solutions below

1
Lakshmanarao Simhadri On

I tried passing exact headers from chrome to make a scrapy request but still it failed. By using some proxy I'm able to get the response. Please have a look at the below solution and let me know your thoughts

from urllib.parse import urlencode
import scrapy

# Get your own api_key from scrapeops or some other proxy vendor
API_KEY = "api_key"
def get_scrapeops_url(url):
        payload = {'api_key': API_KEY, 'url': url}
        proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
        return proxy_url

class FastCompany(scrapy.Spider):
        name = "fastcompany"

        def start_requests(self):
                urls = ["https://www.fastcompany.com/"]
                for url in urls:
                        proxy_url = get_scrapeops_url(url)
                        yield scrapy.Request(url=proxy_url, callback=self.parse)

        def parse(self, response):
                print(response)