Counting no of records from the files using Bash script

67 Views Asked by At

I have written this script to count the no of records that are present in the files on the AWS s3 location.

#!/bin/bash

export bucket="$1"
export prefix="$2"
export threads="$3"

copy_and_count(){
  temp_file=$(mktemp)
  aws s3 cp "s3://$bucket/$1" "$temp_file"
  line_count=$(sed -n '$=' "$temp_file")
  echo $line_count
  rm "$temp_file"
}

main(){
  objects=$(aws s3 ls "s3://$bucket/$prefix" --recursive | awk '{print $4}')
  echo "$objects" | xargs -n 1 -P "$threads" -I {} bash -c 'copy_and_count "{}"' | awk '{s+=$1} END {print s}' > total_count.txt
}

export -f copy_and_count

main

This script works for smaller data sets like 4 or 5 files having 10000 records in the file, but when I am running the same script for a larger data set like 5000 files having the same records in each file then I am not getting the correct count

Here is the command that I am using to run the file

bash script.sh bucket-name object-prefix 10

Here I am using dummy values for the bucket-name and object-prefix and 10 is the number of parallel threads that I am invoking

So I am not sure why this is happening, whether it is due to parallel threads or something else is going on here.

0

There are 0 best solutions below