I am attempting to learn web scraping by fetching and saving product items from the Etsy web page into a JSON file. However, I've encountered a discrepancy between the number of products I can retrieve using a search query page versus an Etsy-defined category page. When scraping a pre-defined category page, such as https://www.etsy.com/uk/c/electronics-and-accessories/gadgets?ref=catnav-11049, I get 47 results (still less than the 64 present on the page). However, if I scrape a search query page, such as https://www.etsy.com/uk/search?q=gadgets&ref=search_bar, I only get 7 results.
I thought this could be due to items being loaded differently for search queries and category pages? I tried to combat this by utilizing Splash to handle JavaScript using Docker and configured Scrapy to make requests to Splash for the search query page. However, I'm still only able to scrape a limited number of products from the search query page, even though I can see more products when I view it manually.
import scrapy
from scrapy_splash import SplashRequest
import logging
class EtsySpider(scrapy.Spider):
name = 'etsy'
start_urls = ['https://www.etsy.com/c/electronics-and-accessories/gadgets?']
item_id = 0
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 2}, meta={'splash': {'endpoint': 'render.html'}})
self.log(f"Splash request: {url}", level=logging.INFO)
def parse(self, response):
for products in response.css('div.js-merch-stash-check-listing'):
yield {
'id': self.item_id,
'name': products.css('h3.wt-text-caption::text').get().strip(),
'category': response.css('#global-enhancements-search-query::attr(value)').get(),
'price': products.css('span.currency-value::text').get(),
'item_link': products.css('a.listing-link').attrib['href'],
'img_link': products.css('img.wt-width-full::attr(src)').get(),
'listing_id': products.css('div.js-merch-stash-check-listing.v2-listing-card::attr(data-listing-id)').get()
}
self.item_id += 1
"""next_buttons = response.css('a.wt-btn.wt-btn--small.wt-action-group__item.wt-btn--icon')
next_page_button = next_buttons[-1]
next_page_link = next_page_button.css('::attr(href)').get()
if next_page_link is not None:
yield response.follow(next_page_link, callback=self.parse)"""
This is my spider class for scraping the site, note I have commented out the section at the bottom for traversing to the next page as I wanted to succesfully scrape the entire page as a start.
I'm not sure if there is some mistake in the way I'm scraping or with the html queries used. Can anyone provide insights into why I might be experiencing this behavior and suggest potential solutions to scrape all products from the Etsy search query page?