I have written this script to count the no of records that are present in the files on the AWS s3 location.
#!/bin/bash
export bucket="$1"
export prefix="$2"
export threads="$3"
copy_and_count(){
temp_file=$(mktemp)
aws s3 cp "s3://$bucket/$1" "$temp_file"
line_count=$(sed -n '$=' "$temp_file")
echo $line_count
rm "$temp_file"
}
main(){
objects=$(aws s3 ls "s3://$bucket/$prefix" --recursive | awk '{print $4}')
echo "$objects" | xargs -n 1 -P "$threads" -I {} bash -c 'copy_and_count "{}"' | awk '{s+=$1} END {print s}' > total_count.txt
}
export -f copy_and_count
main
This script works for smaller data sets like 4 or 5 files having 10000 records in the file, but when I am running the same script for a larger data set like 5000 files having the same records in each file then I am not getting the correct count
Here is the command that I am using to run the file
bash script.sh bucket-name object-prefix 10
Here I am using dummy values for the bucket-name and object-prefix and 10 is the number of parallel threads that I am invoking
So I am not sure why this is happening, whether it is due to parallel threads or something else is going on here.