python urllib, urllib2 how to get SHARP links

119 Views Asked by At

okey my dear helpers, here is the question, I can not get the ' http://example.com/#sharplink ', by the way in the site making infinite loop so I used redirect handler and it need to enable the cookielibrary,

here is the my codes

import urllib2, urllib, cookielib


urllib.FancyURLopener.version = 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.3) Gecko/2008092814 (Debian-3.0.1-1)'

class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
    def redirect_request(self, req, fb, code, msg, headers, newurl):
        m = req.get_method()
        if (code in (301, 302, 303, 307) and m in ('GET', 'HEAD') or code in (301, 302, 303) and m == 'POST'):
            newurl = newurl.replace(' ', '%20')
            newheaders = dict((k,v) for k,v in req.headers.items()
                    if k.lower() not in ("content-length", "content-type")
                    )
            return urllib2.Request(newurl,
                headers=newheaders,
                origin_req_host=req.get_origin_req_host(),
                unverifiable=True)
        else:
            raise HTTPError(req.get_full_url(), code, msg, headers, fp)


cj = cookielib.CookieJar()

opener = urllib2.build_opener(MyHTTPRedirectHandler, urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

req = urllib2.Request('http://example.com/goto/#sharplink')

response = urllib2.urlopen(req)

f=open('bet','w')
f.write(response.read())
f.close()

but every time I just can get ' http://example.com/goto ' page not the sharp page, please help me !!!

1

There are 1 best solutions below

3
dorian On

The fragment part of an URL ("sharplink") is not sent to the webserver (it's commonly used to define a specific section on the given webpage that a link refers to), so it doesn't matter whether you request http://example.com/goto/ or http://example.com/goto/#sharplink.

If you expect the pages to be different, then most likely the site uses an AJAX framework which encodes state in the fragment part of the URL. As urllib and friends do not execute JS, you'd need to use something like phantomjs to get the content of the page.