Web Scraping : Yahoo provides dirtyurl instead of normal url

1k Views Asked by T90 At 14 November 2014 at 16:14

I'm using mechanize to get the top results from yahoo search and scrape data from them, but yahoo provides only dirtyurls, which gives error on further processing, any solution to obtain original link?

example: For the result stackoverflow.com, I get the following tag

<a dirtyhref="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" id="link-1" class="yschttl spt" href="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" target="_blank" data-bk="5054.1"> <b>Stack Overflow</b> - Official Site </a>

So here http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-

represents http://stackoverflow.com

Original Q&A

There are 1 best solutions below

Mikk On 14 November 2014 at 22:14 BEST ANSWER

Assuming that you can isolate easily the content of dirtyhref (you can use BeautifulSoup to parse the link, http://www.crummy.com/software/BeautifulSoup/bs4/doc/), you can use the urlparse package to get only the path (https://docs.python.org/2/library/urlparse.html#urlparse.urlparse). Now you'll have it in a string like:

dirty_href = "/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-"\

Now, it looks to me that fields are separated by /, so you can:

fields = dirty_href.split('/')

Assuming that the fields you are interested in is always the sixth:

dirty_url = fields[5].split('=')[1]

Finally, you can use unquote from the urllib2 package (https://docs.python.org/2/library/urllib.html#urllib.unquote):

>>> urllib2.unquote(dirty_url)
'http://stackoverflow.com/'

You can also not assume that the URL will always be in the sixth field, by cycling over fields and check if it starts with RU=.

Web Scraping : Yahoo provides dirtyurl instead of normal url

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in WEB

Related Questions in WEB-SCRAPING

Related Questions in YAHOO

Trending Questions

Popular # Hahtags

Popular Questions