download/mirror a website on cloudflare for archiving

945 Views Asked by christopher At 23 March 2021 at 10:18

Trying to backup ( download / mirror ) a website for archival purposes. This site is apparently on Cloudflare. My usual tool for this would be wget, but it fails on me (even when using a cookie cfduid header). Example of a not-working wget command:

wget -U "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --header="Accept: text/html" --header="Cookie: __cfduid=someverylongcfduid" --mirror --convert-links --adjust-extension --page-requisites --no-parent -w 1m www.domain.tld

So I thought I'd return to my trusty friend httrack, but it too fails (even when using exported cookies). Example of a not-working httrack command:

httrack -F "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --mirror -b1 -s0 -%c1 -c1 --referer "https://www.domain.tld/" "https://www.domain.tld/"

I do not want to pound the website, so limiting connections and waiting is quite OK. I'd rather have it run longer/slower and be a good netizen along the way.

Currently I'm confronted with either 301 (Moved permanently) or 403 (Forbidden) errors and I'm assuming this is due to Cloudflare. The site is heavy on javascript :-( Does anyone have any tips/advice/solution to get such a website archived?

Original Q&A

There are 1 best solutions below

fireindark707 On 01 July 2021 at 08:26

I think you should try using selenium.

download/mirror a website on cloudflare for archiving

There are 1 best solutions below

Related Questions in WGET

Related Questions in ARCHIVE

Related Questions in HTTRACK

Trending Questions

Popular # Hahtags

Popular Questions