Trying to scrape a website trough lambda. It loads fine, loads chromium and puppeteer.
However whenever there should be additional content loaded trough third party JS script. For example a plugin example.js renders advertisements to <div class="advertisments"></div> that lives in the web content that is triggered by input type. Lets say the behaviour is -
- User types a 2 letters "re" to the input.
example.jsloads certain ads to.advertismentsdiv.
For some reason my chromium either disallows such things or just ignores it. My workflow is
const page = await pupBrowser.newPage()
page.goto('example.com', { waitUntil: 'networkidle2'})
const input = await page.$('.someInput')
await input.type('re')
await delay(2000ms)
await page.waitForElement('.advertisments')
For some reason my advertisments are empty on lambda puppeteer chromium, on local chrome its working.
My arguments are:
[
'--allow-running-insecure-content',
'--autoplay-policy=user-gesture-required',
'--disable-background-timer-throttling',
'--disable-component-update',
'--disable-domain-reliability',
'--disable-features=AudioServiceOutOfProcess,IsolateOrigins,site-per-process',
'--disable-ipc-flooding-protection',
'--disable-print-preview',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--disable-site-isolation-trials',
'--disable-speech-api',
'--disable-web-security',
'--disk-cache-size=33554432',
'--enable-features=SharedArrayBuffer',
'--hide-scrollbars',
'--ignore-gpu-blocklist',
'--in-process-gpu',
'--mute-audio',
'--no-default-browser-check',
'--no-first-run',
'--no-pings',
'--no-sandbox',
'--no-zygote',
'--use-gl=angle',
'--use-angle=swiftshader',
'--window-size=1920,1080',
'--start-maximized'
]
Seems like as commenters suggested, I found out similar reasoning.
Running trough local PC it behaves differently than on lambda. As I've further found out that, simply websites do block traffic from bot's (Considers lambda IP as such).
Simply put, if someone else runs in similar trouble - use puppeteer stealth plugin, otherwise analyze the website.
Since the website Im working with - is hosted on shopify, it seems that there is no simple walk around.