Amazon product data scraping-inconsistent data

69 Views Asked by At

I am learning web scraping. As a part of the mini-project, I am scraping the Amazon.com website for product reviews, review titles, review descriptions, and user names using BeautifulSoup and requests libraries. The product URL is: https://www.amazon.com/Modern-Natural-Branch-Scratching-Lifestyle/dp/B07X1T4G5T/ref=cm_cr_arp_d_product_top?ie=UTF8&th=1

  1. I am getting inconsistent data (number of users-14 and the review titles and descriptions are- 8).In the user names, there are duplicate usernames and one user name is not on the page.
  2. One review title also missing from the page.

The sample code is this:

from bs4 import BeautifulSoup
import requests

product_url='https://www.amazon.com/dp/B07X1T4G5T/ref=syn_sd_onsite_desktop_0?ie=UTF8&pd_rd_plhdr=t&aref=3iDDM1P0Nh&th=1'

response=requests.get(product_url)
soup=BeautifulSoup(response.text,'html.parser')

user_name_tags=soup.find_all('span',class_='a-profile-name')
print('number of users')

review_title_tags=soup.find_all('a',{'data-hook':'review-title'})
print('number of review titles')
len(review_title_tags)

I have tried with different tags to extract the data and checked the data manually for validation. We are expecting the solution to why the problem is occurring and a workaround for this using BeautifulSoup and requests libraries if possible.

2

There are 2 best solutions below

0
Mitchell On

Looks like a solid start to the program. Looking at the URL provided the tags you are looking for are as follows using the .find():

  • Title: .find('a', {'data-hook': 'review-title'})
  • Description: .find('span', {'data-hook': 'review-body'})
  • User: .find('span', {'class': 'a-profile-name'})

It might also be easier to first use .find_all() to pull all the reviews using reviews = soup.find_all('div', {'data-hook': 'review'}). You can then use a for loop to move through all the reviews pulling title, description, and user name.

1
JohanW28 On

For a beginner project Amazon may not be the best option. The python requests library only handles static responses and amazon loads their data dynamically. I checked the HTML within the link you provided and your algorithm is functioning: there are in fact multiple mentions of the same user-name within <spans> with "a-profile-name" as their class.

You might have better luck looking for searching through XHR and Fetch requests sent to amazon, by their servers, after you visit the page. To do this use your browsers developer tools and click on the network tab, reload the page and filter by fetch and XHR. These requests can contain many types of data including HTML, JavaScript, JSON etc. Looking through these may provide you with a better way to scrape for user-names and review_title_tags.

Good Luck!