My task: Task is that I need to create a driver for each page with its own proxy and parse these pages in parallel with other drivers. Also, after collecting links from pages, I need to destroy all drivers and create new ones so that each driver also has its own unique proxy and that they parse products from different pages in parallel.
Problem: The problem is that every driver created has the same ip address from pool_proxies.
Code:
import concurrent.futures
import sys
from selenium.webdriver import ActionChains
import time
from selenium.webdriver.common.keys import Keys
sys.argv.append("-n")
pages = 3
pool_proxies_for_pages = ['proxy0', 'proxy1', 'proxy2']
pool_proxies_for_products = ['proxy5', 'proxy6', 'proxy7']
def create_undetected_webdriver(proxy):
driver = Driver(uc=True, proxy=proxy, agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36')
return driver
def parsing_products(page, list_for_pool, links):
driver = create_undetected_webdriver(list_for_pool[page])
#go through the products
for link in links:
driver.get(link)
def parsing_pages(page, proxy):
driver = create_undetected_webdriver(proxy)
url = f'https://www.ebay.com/e/_electronics/shop-all-ebay-refurbished-cell-phones?_pgn={page}'
driver.get(url)
#scrolling down
ActionChains(driver).send_keys(Keys.END).perform()
time.sleep(7)
# get links
elements = driver.find_elements("xpath", "//a[@tabindex='-1']")
links = []
for i in elements:
link = i.get_attribute("href")
links.append(link)
#destroy driver
driver.quit()
#parsing products
parsing_products(page, pool_proxies_for_products, links)
with concurrent.futures.ThreadPoolExecutor(max_workers=pages) as executor:
for page in range(pages):
executor.submit(parsing_pages, page, pool_proxies_for_pages[page])
I'd be happy to take any offer.
Proxy with auth with SeleniumBase uses the solution from https://stackoverflow.com/a/35293284/7058266, where essentially a zip file is created that contains the proxy credentials that will be used for proxying, and then SeleniumBase loads that extension into Chrome. The default setting assumes that only a single proxy is used, therefore, if the zip file already exists it will get overwritten, saving space/memory. In the case that you need multiple simultaneous proxies, there's an arg that you need to set:
multi_proxy=True, which then creates a uniquely-named zip file for each test that uses a proxy.Here's a sample script that uses that (using the
pytestformat):If you're not using
pytestmultithreading viapytest-xdist, then you should see https://github.com/seleniumbase/SeleniumBase/issues/2478#issuecomment-1981699298 for preventing thread resource conflicts.