The goal is to scrape the information from each job card to create a database. To do this, I'm trying to do the following steps.
- Get the max amount of existing pages
- Get the ID from each job card to get access to each one of them by modifying the base url
- Safe the data on a pandas csv file or create a sql database
So far I tried to get the title from each job card on the first page (10) but the code either return an empty list or an error message.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
#Instantiate the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#Define url
url = 'https://iefponline.iefp.pt/IEFP/pesquisas/search.do'
# load the web page
driver.get(url)
# set maximun time to load the page in seconds
driver.implicitly_wait(15)
#collect data that are withing the main ID block
contents = driver.find_element(By.ID, 'resultados-pesquisa')
# Find all elements with the class name 'offer-card horizontal'
emp_offers = contents.find_elements(By.CLASS_NAME, 'offer-card')
emp_title_list = []
emp_id_list = []
for emp_offer in emp_offers:
offer_title = emp_offer.get_attribute('title')
emp_title_list.append(offer_title)
offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
emp_id_list.append(offer_id)
print(emp_title_list)
print(emp_id_list)
# Close the WebDriver
driver.quit()
['', '', '', '', '', '', '', '', '', ''] [None, None, None, None, None, None, None, None, None, None]
or
"DevTools listening on ws://127.0.0.1:65003/devtools/browser/327b17e5-a97d-4d84-9ae0-c1c03122286a
Traceback (most recent call last):
File "c:\Users\dbelt\Documents\scrape\selenium_iefp.py", line 36, in <module>
offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 416, in find_element
return self._execute(Command.FIND_CHILD_ELEMENT, {"using": by, "value": value})["value"]
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 394, in _execute
return self._parent.execute(command, params)
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 344, in execute
self.error_handler.check_response(response)
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//div[contains(@class, "offer-code")]/span[2]"}
(Session info: chrome=117.0.5938.134); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
GetHandleVerifier [0x0062CFE3+45267]
(No symbol) [0x005B9741]
(No symbol) [0x004ABE1D]
(No symbol) [0x004DED30]
(No symbol) [0x004DF1FB]
(No symbol) [0x004D8041]
(No symbol) [0x004FB084]
(No symbol) [0x004D7F96]
(No symbol) [0x004FB2B4]
(No symbol) [0x0050DDDA]
(No symbol) [0x004FAE36]
(No symbol) [0x004D674E]
(No symbol) [0x004D78ED]
GetHandleVerifier [0x008E5659+2897737]
GetHandleVerifier [0x0092E78B+3197051]
GetHandleVerifier [0x00928571+3171937]
GetHandleVerifier [0x006B5E40+606000]
(No symbol) [0x005C338C]
(No symbol) [0x005BF508]
(No symbol) [0x005BF62F]
(No symbol) [0x005B1D27]
BaseThreadInitThunk [0x757B7BA9+25]
RtlInitializeExceptionChain [0x7711B79B+107]
RtlClearBits [0x7711B71F+191]"
Depending on if I try to get the information with get_attribute or XPATH.
Also I notice that when I tried to copy the XPATH on this site the path is just amazingly large in relationship with other websites.
Lastly, Many class names have spaces within them and I don't now the best way to use find_elements with this type of class names.
It appears that
emp_id_listis empty because there is no "title" attribute in the element you are trying to access by its class. You may need to locate the title element within the element with the specified class.Regarding the XPATH issue, it seems that your XPATH expression is incorrect.
If you want to locate elements using their class attribute, and if the elements have multiple class names separated by spaces, you should use a dot (.) to separate class names within the
By.CLASS_NAMEmethod. Here's the corrected code: