1

There are 1 best solutions below

0
Sebastian Nagel On BEST ANSWER

Since every few hours a new WARC file is added to the news dataset, a static file list does not make sense. Instead you can get a list of files using the AWS CLI - for any subset by year or month, e.g.

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2017/09/

See also the news data release announcement.