I am making a POST request to an API, and it works fine over Postman but not in Scrapy. This is the 400 Status error Scrapy gives me:
2024-03-19 03:02:08 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <400 https://somesite.com/api/search> Set-Cookie: sUniqueID=20240319100208-50.106.12.146-dj56d4dide; expires=Sun, 19-Mar-2034 10:02:08 GMT; path=/; secure; HttpOnly
2024-03-19 03:02:08 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://somesite.com/api/search> (referer: None) 2024-03-19 03:02:08 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://somesite.com/api/search>: HTTP status code is not handled or not allowed 2024-03-19 03:02:08 [scrapy.core.engine] INFO: Closing spider (finished) 2024-03-19 03:02:08 [scrapy.statscollectors]
This is my Scrapy code:
from scrapy.loader import ItemLoader
from scrapy.http import FormRequest
import scrapy
from some_scraper.items import SomeItem
from scrapy_playwright.page import PageMethod
class SomeScraperSpider(scrapy.Spider):
name = 'someScraper'
def start_requests(self):
url = ('https://somesite.com/api/search')
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)',
'Content-Type': 'application/json',
'Host': 'somesite.com',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive'
}
frmdata={"token": "eg7t4q6p6pdv59m1cn22e58vmphiv2",
"cols": ["colmX", "colmY"],
"max": "80"
}
yield scrapy.FormRequest(url=url,
callback=self.parse_categories,
headers=headers,
formdata=frmdata,
meta={'playwright': True})
settings.py:
BOT_NAME = "some_scraper"
USER_AGENT= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
SPIDER_MODULES = ["some_scraper.spiders"]
NEWSPIDER_MODULE = "some_scraper.spiders"
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
ROBOTSTXT_OBEY = False
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler"
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
main.py:
from some_scraper.spiders.src import SomeScraperSpider
from scrapy.crawler import CrawlerProcess
def main():
process = CrawlerProcess(settings={
'COOKIES_DEBUG': 'True',
'COOKIES_ENABLED': 'True',
'PLAYWRIGHT_LAUNCH_OPTIONS':{
'headless':'False',
'timeout':2000*1000
},
'PLAYWRIGHT_BROWSER_TYPE':'firefox',
'PLAYWRIGHT_SETTINGS':{
'acceptsCookies': 'True',
}
})
process.crawl(SomeScraperSpider)
process.start()
if __name__ == '__main__':
main()
In Postman I make this POST request:
POST https://somesite.com/api/search
In the Postman body, I put in:
{"token": "eg7t4q6p6pdv59m1cn22e58vmphiv2",
"cols": ["colmX", "colmY"],
"max": "80"}
The Headers in Postman show:
Authorization: Bearer eg7t4q6p6pdv59m1cn22e58vmphiv2
Cookie: __RequestVerificationToken=UzaS9tX-IM3KTLlNTsQw9nklSnxplY4ehkAKnIjZw5aJ2BjEZ8oGB7bi6IKQOjdxf-izYKUq2_-g-JxK1_QjuvhOFDuBELzSwlyfol_UcBg1; fb_SessionId=66cuv1enskencmluva229q5ohnpj43; sUniqueID=20240208013723-50.106.12.146-m5mf01hk0k
Postman-Token: <calculated when request is sent>
Content-Type: application/json
Content-Length: <calculated when request is sent>
Host: <calculated when request is sent>
User-Agent: PostmanRuntime/7.36.3
Accept: */*
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
In both Postman and Scrapy the Token is added to the Body before being sent, not the Headers. But even when I add the the Token and Cookie to the Scrapy header I receive the same result.