How to decompress a warc.zst file?

2.2k Views Asked by At

I am trying to decompress a WARC ZST file that I downloaded from here: https://archive.org/details/archiveteam_yahooanswers_20210422220546_c4fac540

I tried the command zstd -d yahooanswers_20210422220546_c4fac540.1619026173.megawarc.warc.zst but I got this error: 73.megawarc.warc.zst : 0 MB... 73.megawarc.warc.zst : Decoding error (36) : Dictionary mismatch How can I find the said dictionary or are there any alternatives to this?

1

There are 1 best solutions below

3
Jimmy On

The dictionary can be found inside the first skippable frame of the warc.

To extract the dictionary OrIdow6 write this to extract it: https://transfer.notkiska.pw/inline/TXlRo/xtract.py

You'll require python3, zstd and zstandard

python ./xtract.py /path/to/megawarc.warc.zst > dict

Then you can

zstd -d /path/to/megawarc.warc.zst -D dict

And you should be able to view the megawarc with your standard warc viewing tools