I have a command:
s5cmd --endpoint-url http://192.168.1.40:9000 ls "s3://ccdata/minhash/20*/**" | awk '{print $NF}' > minhash_listings.txt
I have a local Minio deployment that has 16-17 million files in it so far. the folder structure of the minhash has 84 subfolders, with 5000 subfolders underneath that, and then up to 15 files under each folder. An example would be /minhash/2014-01/0001/filename.json.gz
The underlying hardware is a Dell R370 with 40 x 16TB drives and 3 nvme drives holding metadata, on a zfs pool.
It takes about 3 days for this command to complete. I do not see any real hit to network, cpu, ram, disk IO when it is running. This makes me think this entire command is single threaded.
My question is, given what I am trying to create, is there a better/faster way? I need to regenerate this file about daily.
s5cmd with -c flag: The s5cmd tool itself might offer parallel processing capabilities. Check if it supports the -c (concurrency) flag. You can specify a higher number of concurrent connections to list files simultaneously. Refer to s5cmd --help for details. Python Scripting: Write a Python script using libraries like boto3 (for Minio interaction) and multiprocessing to parallelize the listing process across multiple cores. This allows concurrent listing of files from different subfolders. 2. Leverage Minio Server-Side Listing:
Minio CLI stat command: The Minio CLI offers a stat command that can retrieve bucket statistics including the number of objects. You can use this to get an approximate file count without listing each file individually. Minio Python SDK: The Minio Python SDK provides methods like list_objects that allow listing objects with filtering options. You can potentially filter by prefix to list only objects within the "s3://ccdata/minhash/20*" folder structure, reducing the number of objects retrieved. 3. Optimize Command Structure:
Reduce awk usage: The awk command likely adds some overhead. Consider modifying the s5cmd command to directly output the filename part (using options like --csv or custom formatting) instead of piping it through awk. 4. Minio Server-Side Filtering (if supported):
Minio Lifecycle Rules: If your Minio server supports lifecycle rules, you could potentially configure a rule to automatically generate a daily manifest file containing the list of minhash files. This would eliminate the need to run the s5cmd command altogether.