Scrape dynamic website using Python selenium

Question

Scrape dynamic website using Python selenium

103 Views Asked by dbeltranor At 05 October 2023 at 06:11

The goal is to scrape the information from each job card to create a database. To do this, I'm trying to do the following steps.

Get the max amount of existing pages
Get the ID from each job card to get access to each one of them by modifying the base url
Safe the data on a pandas csv file or create a sql database

So far I tried to get the title from each job card on the first page (10) but the code either return an empty list or an error message.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd


#Instantiate the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#Define url
url = 'https://iefponline.iefp.pt/IEFP/pesquisas/search.do'
# load the web page
driver.get(url)

# set maximun time to load the page in seconds
driver.implicitly_wait(15)

#collect data that are withing the main ID block
contents = driver.find_element(By.ID, 'resultados-pesquisa')
# Find all elements with the class name 'offer-card horizontal'
emp_offers = contents.find_elements(By.CLASS_NAME, 'offer-card')

emp_title_list = []
emp_id_list = []

for emp_offer in emp_offers:

    offer_title = emp_offer.get_attribute('title')
    emp_title_list.append(offer_title)

    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
    emp_id_list.append(offer_id)

print(emp_title_list)
print(emp_id_list)
    
# Close the WebDriver
driver.quit()

['', '', '', '', '', '', '', '', '', ''] [None, None, None, None, None, None, None, None, None, None]

or

"DevTools listening on ws://127.0.0.1:65003/devtools/browser/327b17e5-a97d-4d84-9ae0-c1c03122286a
Traceback (most recent call last):
  File "c:\Users\dbelt\Documents\scrape\selenium_iefp.py", line 36, in <module>
    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 416, in find_element
    return self._execute(Command.FIND_CHILD_ELEMENT, {"using": by, "value": value})["value"]
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 394, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 344, in execute
    self.error_handler.check_response(response)
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//div[contains(@class, "offer-code")]/span[2]"}
  (Session info: chrome=117.0.5938.134); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
        GetHandleVerifier [0x0062CFE3+45267]
        (No symbol) [0x005B9741]
        (No symbol) [0x004ABE1D]
        (No symbol) [0x004DED30]
        (No symbol) [0x004DF1FB]
        (No symbol) [0x004D8041]
        (No symbol) [0x004FB084]
        (No symbol) [0x004D7F96]
        (No symbol) [0x004FB2B4]
        (No symbol) [0x0050DDDA]
        (No symbol) [0x004FAE36]
        (No symbol) [0x004D674E]
        (No symbol) [0x004D78ED]
        GetHandleVerifier [0x008E5659+2897737]
        GetHandleVerifier [0x0092E78B+3197051]
        GetHandleVerifier [0x00928571+3171937]
        GetHandleVerifier [0x006B5E40+606000]
        (No symbol) [0x005C338C]
        (No symbol) [0x005BF508]
        (No symbol) [0x005BF62F]
        (No symbol) [0x005B1D27]
        BaseThreadInitThunk [0x757B7BA9+25]
        RtlInitializeExceptionChain [0x7711B79B+107]
        RtlClearBits [0x7711B71F+191]"

Depending on if I try to get the information with get_attribute or XPATH.

Also I notice that when I tried to copy the XPATH on this site the path is just amazingly large in relationship with other websites.

Lastly, Many class names have spaces within them and I don't now the best way to use find_elements with this type of class names.

Original Q&A

There are 2 best solutions below

**Bikash Timsina** · Answer 1 · 2023-10-05T07:01:30.760000

It appears that emp_id_list is empty because there is no "title" attribute in the element you are trying to access by its class. You may need to locate the title element within the element with the specified class.
Regarding the XPATH issue, it seems that your XPATH expression is incorrect.
If you want to locate elements using their class attribute, and if the elements have multiple class names separated by spaces, you should use a dot (.) to separate class names within the By.CLASS_NAME method. Here's the corrected code:

for emp_offer in emp_offers:
    # Find the title element within the offer using TAG_NAME and get its 'title' attribute
    offer_title = emp_offer.find_element(By.TAG_NAME, "a").get_attribute('title')
    emp_title_list.append(offer_title)

    # Corrected XPATH to find the offer_id element
    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-card-footer")]/div/div[2]/span[2]').text
    emp_id_list.append(offer_id)

**Anandhu H** · Answer 2 · 2023-10-05T07:22:39.367000

You can also try this approach with slight changes

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://iefponline.iefp.pt/IEFP/pesquisas/search.do')

# set maximun time to load the page in seconds
driver.implicitly_wait(15)

# click on dropdown menu to change results per page to 50
driver.find_element("xpath", '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/div').click()
driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/ul/li[3]').click()

page_count = driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[2]/div/ul/li[5]/a').text
print(f"Total pages with 50 results per page: {page_count}")

emp_offers = driver.find_elements('xpath', "//*[contains(@id, 'ofertacard_')]")

emp_title_list = []
emp_id_list = []

for emp_offer in emp_offers:
    # text content of element
    emp_offer = emp_offer.text
    offer_title = emp_offer.split('\n')[0]
    offer_id = emp_offer.split('\n')[2].split()[1]
    emp_title_list.append(offer_title)
    emp_id_list.append(offer_id)

print(emp_title_list)
print(emp_id_list)

Scrape dynamic website using Python selenium

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in SELENIUM-WEBDRIVER

Related Questions in BEAUTIFULSOUP

Related Questions in SCRAPE

Trending Questions

Popular # Hahtags

Popular Questions