I am learning web scraping. As a part of the mini-project, I am scraping the Amazon.com website for product reviews, review titles, review descriptions, and user names using BeautifulSoup and requests libraries. The product URL is: https://www.amazon.com/Modern-Natural-Branch-Scratching-Lifestyle/dp/B07X1T4G5T/ref=cm_cr_arp_d_product_top?ie=UTF8&th=1
- I am getting inconsistent data (number of users-14 and the review titles and descriptions are- 8).In the user names, there are duplicate usernames and one user name is not on the page.
- One review title also missing from the page.
The sample code is this:
from bs4 import BeautifulSoup
import requests
product_url='https://www.amazon.com/dp/B07X1T4G5T/ref=syn_sd_onsite_desktop_0?ie=UTF8&pd_rd_plhdr=t&aref=3iDDM1P0Nh&th=1'
response=requests.get(product_url)
soup=BeautifulSoup(response.text,'html.parser')
user_name_tags=soup.find_all('span',class_='a-profile-name')
print('number of users')
review_title_tags=soup.find_all('a',{'data-hook':'review-title'})
print('number of review titles')
len(review_title_tags)
I have tried with different tags to extract the data and checked the data manually for validation. We are expecting the solution to why the problem is occurring and a workaround for this using BeautifulSoup and requests libraries if possible.
Looks like a solid start to the program. Looking at the URL provided the tags you are looking for are as follows using the
.find():.find('a', {'data-hook': 'review-title'}).find('span', {'data-hook': 'review-body'}).find('span', {'class': 'a-profile-name'})It might also be easier to first use
.find_all()to pull all the reviews usingreviews = soup.find_all('div', {'data-hook': 'review'}). You can then use a for loop to move through all the reviews pulling title, description, and user name.