How can I get the images that come after searching on google?

244 Views Asked by At

I'm searching with Google. Later, I want to get photos of the products I come across.

import requests, json, re
from parsel import Selector

params = {
    "q": "tutku migros",
    "hl": "tr",     # language
    "gl": "tr",     # country of the search, US -> USA
    #"tbm": "shop"   # google search shopping tab
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)
results = selector.css(".LicuJb")
a = results.css("img::attr(src)").extract()

This is the return I got.

['data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==', 'data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==', 'data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==', 'data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==', 'data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==']

This is what i expected to get

What I got is not same. Also they are all the same.

1

There are 1 best solutions below

2
rpm On

We can save the raw HTML and take a look at it to get a better idea of what's going on here. I added this to the end of your script:

# save html to file
with open('out.html', 'w+') as f:
    f.write(html.text)

If we take a look at out.html and search for our LicuJb tag, we see that parsel is actually getting the correct values. Then why do we see different images when we go to that page in our web browser? This is because the webpage is running some javascript, which eventually replaces the image source placeholders with real image data. However, because we're using python's requests library, which simply fetches the static webpage, the javascript never runs and the placeholders never get replaced. This article explains the issue a little more.

The solution is to use a python library that allows the javascript to run, such as Selenium. Rather than just fetching the static HTML, Selenium simulates a complete web browser, meaning it's able to run the javascript of dynamic web pages. (This also means it takes much longer.) Here's how you might get the images you're looking for using Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from urllib import parse

options = webdriver.ChromeOptions()
options.add_argument('--headless') # don't display browser window
with webdriver.Chrome(options=options) as driver:
    params = {
        "q": "tutku migros",
        "hl": "tr",     # language
        "gl": "tr",     # country of the search, US -> USA
        #"tbm": "shop"   # google search shopping tab
    }
    url = f'http://google.com/search?{parse.urlencode(params)}'

    driver.get(url)
    # find all elements whose class name starts with (^=) LicuJb
    results = driver.find_elements(By.CSS_SELECTOR, "div[class^='LicuJb']")

    image_data = []
    for result in results:
        image = result.find_element(By.TAG_NAME, 'img')
        src = image.get_attribute('src')
        image_data.append(src)

print(image_data)