Trying to scrape an image using Nokogiri but it returns a link that I was not expecting

81 Views Asked by At

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.

This is the link that I want to get: https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f

But instead I got this: https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png

Why?

This is what I tried:

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read

html = Nokogiri::HTML.parse(serialized_html)

title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value

{
  title: title,
  overview: overview,
  poster_url: poster,
}
2

There are 2 best solutions below

3
javiyu On BEST ANSWER

It has nothing to do with your ruby code.

If you run in your terminal something like

curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/ 

You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.

The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c

Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.

2
Jan Vítek On

Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.

First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.

First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.

Then the data can be extracted like this:

require 'nokogiri'
require 'open-uri'
require 'json'

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)

data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']