Amazon product data scraping-inconsistent data

69 Views Asked by ml ds At 23 January 2024 at 04:37

I am learning web scraping. As a part of the mini-project, I am scraping the Amazon.com website for product reviews, review titles, review descriptions, and user names using BeautifulSoup and requests libraries. The product URL is: https://www.amazon.com/Modern-Natural-Branch-Scratching-Lifestyle/dp/B07X1T4G5T/ref=cm_cr_arp_d_product_top?ie=UTF8&th=1

I am getting inconsistent data (number of users-14 and the review titles and descriptions are- 8).In the user names, there are duplicate usernames and one user name is not on the page.
One review title also missing from the page.

The sample code is this:

from bs4 import BeautifulSoup
import requests

product_url='https://www.amazon.com/dp/B07X1T4G5T/ref=syn_sd_onsite_desktop_0?ie=UTF8&pd_rd_plhdr=t&aref=3iDDM1P0Nh&th=1'

response=requests.get(product_url)
soup=BeautifulSoup(response.text,'html.parser')

user_name_tags=soup.find_all('span',class_='a-profile-name')
print('number of users')

review_title_tags=soup.find_all('a',{'data-hook':'review-title'})
print('number of review titles')
len(review_title_tags)

I have tried with different tags to extract the data and checked the data manually for validation. We are expecting the solution to why the problem is occurring and a workaround for this using BeautifulSoup and requests libraries if possible.

Original Q&A

There are 2 best solutions below

Mitchell On 23 January 2024 at 05:16

Looks like a solid start to the program. Looking at the URL provided the tags you are looking for are as follows using the .find():

Title: .find('a', {'data-hook': 'review-title'})
Description: .find('span', {'data-hook': 'review-body'})
User: .find('span', {'class': 'a-profile-name'})

It might also be easier to first use .find_all() to pull all the reviews using reviews = soup.find_all('div', {'data-hook': 'review'}). You can then use a for loop to move through all the reviews pulling title, description, and user name.

JohanW28 On 23 January 2024 at 18:32

For a beginner project Amazon may not be the best option. The python requests library only handles static responses and amazon loads their data dynamically. I checked the HTML within the link you provided and your algorithm is functioning: there are in fact multiple mentions of the same user-name within <spans> with "a-profile-name" as their class.

You might have better luck looking for searching through XHR and Fetch requests sent to amazon, by their servers, after you visit the page. To do this use your browsers developer tools and click on the network tab, reload the page and filter by fetch and XHR. These requests can contain many types of data including HTML, JavaScript, JSON etc. Looking through these may provide you with a better way to scrape for user-names and review_title_tags.

Good Luck!

Amazon product data scraping-inconsistent data

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in PYTHON-REQUESTS-HTML

Trending Questions

Popular # Hahtags

Popular Questions