download/mirror a website on cloudflare for archiving

945 Views Asked by At

Trying to backup ( download / mirror ) a website for archival purposes. This site is apparently on Cloudflare. My usual tool for this would be wget, but it fails on me (even when using a cookie cfduid header). Example of a not-working wget command:

wget -U "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --header="Accept: text/html" --header="Cookie: __cfduid=someverylongcfduid" --mirror --convert-links --adjust-extension --page-requisites --no-parent -w 1m www.domain.tld

So I thought I'd return to my trusty friend httrack, but it too fails (even when using exported cookies). Example of a not-working httrack command:

httrack -F "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --mirror -b1 -s0 -%c1 -c1 --referer "https://www.domain.tld/" "https://www.domain.tld/"

I do not want to pound the website, so limiting connections and waiting is quite OK. I'd rather have it run longer/slower and be a good netizen along the way.

Currently I'm confronted with either 301 (Moved permanently) or 403 (Forbidden) errors and I'm assuming this is due to Cloudflare. The site is heavy on javascript :-( Does anyone have any tips/advice/solution to get such a website archived?

1

There are 1 best solutions below

1
fireindark707 On

I think you should try using selenium.