I'm trying to automate a web search via Python.
The website is behind hCaptcha but I'm using a 2captcha solver.
Although, I've replicated web browser's behavior, I'm still being asked to solve the hCaptcha again.
Here's what I've tried:
import httpx
import trio
from twocaptcha import TwoCaptcha
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0',
'Referer': 'https://iapps.courts.state.ny.us/nyscef/CaseSearch?TAB=courtDateRange',
'Origin': 'https://iapps.courts.state.ny.us'
}
API_KEY = 'hidden'
async def solve_captcha():
solver = TwoCaptcha(API_KEY)
return solver.hcaptcha(
sitekey='600d5d8e-5e97-4059-9fd8-373c17f73d11',
url='https://iapps.courts.state.ny.us/'
)['code']
async def main():
async with httpx.AsyncClient(base_url='https://iapps.courts.state.ny.us/nyscef/', headers=headers, follow_redirects=True) as client:
r = await client.post('CaseSearch?TAB=courtDateRange')
print('[*] - Solving CAPTCHA!')
cap = await solve_captcha()
print('[*] - CAPTCHA Solved')
# Court: Chautauqua County Supreme Court
data = {
'selCountyCourt': '4667226',
'txtFilingDate': '02/14/2024',
'g-recaptcha-response': cap,
'h-captcha-response': cap,
'btnSubmit': 'Search',
}
r = await client.post('https://iapps.courts.state.ny.us/nyscef/CaseSearch?TAB=courtDateRange', data=data)
with open('r.html', 'w') as f:
f.write(r.text)
if __name__ == "__main__":
trio.run(main)
I adjusted your code to repeatedly solve Captchas if they appeared. After going through 10 captchas in a row, I assumed the website knew I was scraping, and would infinitely provide captchas; for that reason I have created a different solution that will work as well as save money from 2captcha fees.
This solution
Selenium, and requires theundetected_chromedriver. The driver is open source and located here. It can be installed with the following:Using this chromedriver allows you to be undetected to almost all modern detection methods. It also saves time from not doing captchas, and saves money by not paying for twocaptcha's services. Here is the code that scrapes your desired page:
This saves the full html of your desired page to "page.html".
NOTE: if you initially get an error about the Chromedriver version not being supported, close your browser and run the module with no browser open.
If you wish to see your prior code running, and see that it infinitely runs into captchas, here is the code I used to determine that: