Error 403 forbidden while scraping with python

297 Views Asked by At

I need to scrape a website but the thing is i've tried so many times with my python script and didnt work, i tried with differents headers and didnt work, i also tried with Httrack and wget and i keep facing the same problem which is the 403 forbidden error. Is there any way to bypass this, any header, proxys, browser configuration?

I did manage to get the website html with python script but with no deep levels and images are blurred, if someone know how to improve the python script i will be glad.

EDIT - This is the code of the script:

import urllib.parse
import cfscrape
from bs4 import BeautifulSoup

scraper = cfscrape.create_scraper()
response = scraper.get("website_url")

soup = BeautifulSoup(response.content, 'lxml')
with open('index.html', 'w') as f:
    f.write(str(soup))

for img in soup.findAll('img'):
    if img.get('src') !=None:
        img_url=img.get('src')
        name=img_url.split('/')[-1]
        name = urllib.parse.unquote(name)
        img_response=scraper.get(img_url)
        file=open(name,"wb")
        file.write(img_response.content)
1

There are 1 best solutions below

1
Jonathan On

Since I don't have enough info, I can't solve this problem for you. But here are some possible problems you might be having and how to solve them.

For starters, I have some experience from trying to make a kahoot bot using the requests-HTML module (allowing for dynamic javascript rendering).

The way that sending a request works is that it needs the miminum headers, generally just the device headers, and basic auth. I noticed that the site you commented as an example requires a sign in. When you send a request, part of that auth is a cookie generally. As you probaly know, a cookie is pieces of saved data on a website pertaining to your device. When you log into a site, it generally asigns you a validater cookie, which tells it you have authentication to be on it (as the cookie stays with your browser for a set amount of time). Sadly, there are no other ways to bypass this if that is your issue. You'd have to create a session first, and use it to sign in with your login credentials (though some sites have anti-bot systems in place like captcha). If there is a captcha during log-in, then there is no chance you will be able to scrape data from it without access to the api (though I don't have much experience with web API's). I would also recommend trying an alternave of urllib3 to see what requests work (request, post, head, etc). It's possible that your specific request sent is just invalid entirely (chosen by website creators upon creation). Even if something like the basic request works, you then have to find what data you want to grab, and if the site is dynamically loaded. In terms on dynamically rendered javascript, the most likely scenario for your problem, there are several topics you need to learn basic knowledge about to solve it.

  • Inspect Element
  • Element ID/Path
  • Requests_HTML

Each of these components are crucial for scraping elements off the page, which as everything shown is part of an element, so is whatever you want to grab. Once you render the javascript using requests_HTML, you have to make sure you have the cookie (if needed), and any other validation headers (plently of lists online, just search: headers needed to scrape websites). Though I haven't used it before, you can use Selenium and WebDriver to login, at least that's what I understand from the documentation. To do that, you need to grab the x-path that I mentioned, and then return data to fill out that path (either via an encoded url based on the variable or somehow edit it directly).

I hope that this helps you figure this out!