Find Favicons in HTML using Perl

262 Views Asked by At

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.

I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.

For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."

my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;

I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.

The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:

my $results = $mech->find_all_links(  );

foreach my $result (@{ $results }) {
    my $attrs = $result->attrs();
    #'tag' => "apple-touch-icon"
    
    foreach my $attr (sort keys %{ $attrs }) {
        if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
            say STDERR "I found it:" . $result->url();
        }

        # Add tests for other types of icons here.
        # E.g. "mask-icon" and "shortcut icon."

    }

}

That works, but it seems messy. Is there a better way?

2

There are 2 best solutions below

5
brian d foy On BEST ANSWER

Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:

link[rel*=icon i][href]

This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].

Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):

use v5.10;

use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);

my $results = $ua->get( shift )
    ->result
    ->dom
    ->find( 'link[rel*=icon i][href]' )
    ->map( attr => 'href' )
    ->to_array
    ;

say join "\n", @$results;

That works pretty well so far:

$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico

$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2

$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png

Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:

link[rel*=icon][href], link[rel*=ICON][href]

or different values in rel:

link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]

Line up as many of those as you like.

But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:

my %Interesting = ...;
my $results = $ua->get( shift )
    ->result
    ->dom
    ->find( '...' )
    ->grep( sub { exists $Interesting{ $_->attr('rel') } } )
    ->map( attr => 'href' )
    ->to_array
    ;

I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.

7
Polar Bear On

The problem is very easy to solve with:

  • assistance of any module allowing to load webpage
  • define $regex for all possible favicon variations
  • look for <link rel="$regex" href="icon_address" ...>

Note: The script has default YouTube url embedded in the code

use strict;
use warnings;
use feature 'say';

use HTTP::Tiny;

my $url = shift || 'https://www.youtube.com/';

my $icons = get_favicon($url);

say for @{$icons};

sub get_favicon {
    my $url = shift;
    
    my @lookup = (
                    'shortcut icon',
                    'apple-touch-icon',
                    'image_src',
                    'icon',
                    'alternative icon'
                );
                
    my $re      = join('|',@lookup);
    my $html    = load_page($url);
    my @icons   = ($html =~ /<link rel="(?:$re)" href="(.*?)"/gmsi);
    
    return \@icons;
}

sub load_page {
    my $url = shift;
    
    my $response = HTTP::Tiny->new->get($url);
    my $html;

    if ($response->{success}) {
        $html = $response->{content};
    } else {
        say 'ERROR:  Could not extract webpage';
        say 'Status: ' . $response->{status};
        say 'Reason: ' . $response->{reason};
        exit;
    }

    return $html;
}

Run as script.pl

https://www.youtube.com/s/desktop/8259e7c9/img/favicon.ico
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_32.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_48.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_96.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_144.png
https://www.youtube.com/img/desktop/yt_1200.png

Run as script.pl "http://www.microsoft.com/"

https://c.s-microsoft.com/favicon.ico?v2

Run as script.pl "http://finance.yahoo.com/"

https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg