I'm using mechanize to get the top results from yahoo search and scrape data from them, but yahoo provides only dirtyurls, which gives error on further processing, any solution to obtain original link?
example: For the result stackoverflow.com, I get the following tag
<a dirtyhref="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" id="link-1" class="yschttl spt" href="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" target="_blank" data-bk="5054.1"> <b>Stack Overflow</b> - Official Site </a>
represents http://stackoverflow.com
Assuming that you can isolate easily the content of
dirtyhref(you can useBeautifulSoupto parse the link, http://www.crummy.com/software/BeautifulSoup/bs4/doc/), you can use theurlparsepackage to get only the path (https://docs.python.org/2/library/urlparse.html#urlparse.urlparse). Now you'll have it in a string like:Now, it looks to me that fields are separated by
/, so you can:Assuming that the fields you are interested in is always the sixth:
Finally, you can use
unquotefrom theurllib2package (https://docs.python.org/2/library/urllib.html#urllib.unquote):You can also not assume that the URL will always be in the sixth field, by cycling over
fieldsand check if it starts withRU=.