Can't access web archived rewritten URLs using javascript

73 Views Asked by At

I'm trying to get some href properties using javascript on a web-archived page, but the result I'm getting is different from what's actually in the page. I think it'll be easier to explain what I mean with an actual example, so, for the following web-archived page:

https://web.archive.org/web/20090101161506/http://www.sapo.pt/

There's a "mail" button on the right side: mail button

If you inspect it with the browser you get the following element:

<a 
  onclick="lk('/mail','tools');toolTabSwap('mailBox',false);return(false);" 
  href="https://web.archive.org/web/20090101161506/http://mail.sapo.pt/" 
  id="bmailBox" 
  class="widget_btn"
>
    <div>Mail</div>
</a>

Notice how the href is pointing towards web.archive.org

However, when I try to get this href value using javascript, what I get is different:

console output

Now, I know that web archives rewrite the original hrefs to point towards their archived version, and it's clearly what's happening here. What I don't understand is why I can't get the rewritten URL via javascript, and instead it's showing me the href from before it's been rewritten.

What's going on here? How can I get the href to the web-archived version rather than the original link?

Edit:

Something I want to clarify: I'm not looking to follow the URL, I just want to use JS to get the full URL from the web archive, including the web.archive.org domain and the timestamp.

For the example link, I want to find a way to get this:

https://web.archive.org/web/20090101161506/http://mail.sapo.pt/

But all I'm getting is this:

http://mail.sapo.pt/

1

There are 1 best solutions below

9
jexroid On

when you open the URL (https://web.archive.org/web/20090101161506/http://mail.sapo.pt/) it gives you a 302 HTTP status code.

the status code 302 is a temporary redirect, which means the https://web.archive.org/web/20090101161506 redirects you to http://mail.sapo.pt/. javascript will follow the redirect and it reaches http://mail.sapo.pt/ and then it will receive the 200 HTTP status code and that's what you will see. the scenario is like this:

request --> web.archive.org (redirect user to)--> mail.sapo.pt

you might ask why websites use redirection? URL redirection is done for various reasons:

  • for URL shortening
  • to prevent broken links when web pages are moved
  • to allow multiple domain names belonging to the same owner to refer to a single website.
  • to guide navigation into and out of a website

if you want to get the original href, I suggest you to use jQuery:

import jQuery using CDN or whatever method you like:

  • add the CDN script link in the head of your HTML file
<head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
</head>
  • select the href of the <a> element and you can do whatever you want with it without getting redirect:
var hrefValue = $('#bmailBox').attr('href');