How to Count number of files in AWS S3 Server using python script and boto3 library

123 Views Asked by At

I am trying to write a script in python that will go to AWS S3 Bucket link and count how many files there are. There are a lot of files but I only want to count files who's name start with file_. These files are incremental in nature. On top of that I also have to transverse the folder as there files are chunk of a video so I also want to transverse quality and check their chunk count too.

Path would be like: s3://url/144p/

In this path there will be chunks that needed to be count.

I used boto3 library. Coming up is my code:

import csv
import boto3

# Base S3 URL
base_s3_url = 's3://coursevideotesting/'  # Replace this with your base S3 URL

# Input and output CSV file names
input_csv_file = 'ldt_ffw_course_videos_temp.csv'  # Replace with your input CSV file name
output_csv_file = 'file_count_result.csv'  # Replace with your output CSV file name

# Function to count 'file_000.ts' objects in a specific S3 folder
def count_file_objects(s3_bucket, s3_folder):
    s3 = boto3.client('s3')
    response = s3.list_objects_v2(Bucket=s3_bucket, Prefix=s3_folder)

    # Count 'file_000.ts' objects in the folder
    count = sum(1 for obj in response.get('Contents', []) if obj['Key'].startswith('file_000.ts'))
    return count

# Read URLs from input CSV and check file counts
with open(input_csv_file, mode='r') as infile, open(output_csv_file, mode='w', newline='') as outfile:
    reader = csv.DictReader(infile)
    fieldnames = ['URL', 'Actual Files', 'Expected Files']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in reader:
        s3_url = base_s3_url + row['course_video_s3_url']  # Replace 'URL_Column_Name' with your column name
        expected_files = int(row['course_video_ts_file_cnt'])  # Replace 'Expected_Files_Column_Name' with your column name
        actual_files = count_file_objects('coursevideotesting', s3_url)  # Replace 'your-s3-bucket-name' with your bucket name
        
        writer.writerow({'URL': s3_url, 'Actual Files': actual_files, 'Expected Files': expected_files})

What I recieved:

URL,Actual Files,Expected Files
s3://coursevideotesting/.../144p/,0,28
s3://coursevideotesting/.../144p/,0,34
s3://coursevideotesting/.../144p/,0,54
s3://coursevideotesting/.../144p/,0,57

What I expected:

URL,Actual Files,Expected Files
s3://coursevideotesting/.../144p/,28,28
s3://coursevideotesting/.../144p/,34,34
s3://coursevideotesting/.../144p/,52,54
s3://coursevideotesting/.../144p/,57,57

Actual file being less than actual in case of missing file/corruption. So that I can directly work on those chunks instead of manually checking everyone of those. I have 5 quality types and several chunks in each folder.

Output for print("Response: ",response.get('Contents')) Checking the Folder in this is just checking weather path is right or wrong. In this path, there are file_000.ts, file_001.ts and so on that I want to count.

Checking folder: s3://coursevideotesting/Financial_Freedom_Course_Kannada/00_Course_Trailer_New_update/360p/
Response:  None
Actual chunks found: 0

As for print(response['Contents']) it throws an error.

1

There are 1 best solutions below

0
On

You might find it easier to use resource methods rather than client methods. They will do pagination for you (to handle more than 1000 objects) and the functions are more Pythonic.

For example, you can count the files like this:

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('BUCKET-NAME')

prefix = 'foo/'

count = 0

for object in bucket.objects.filter(Prefix=prefix):
    if object.key.endswith('/file_000.ts'):
        count += 1

print(count)

Note that this example is using endswith('/file_000.ts') because the key of an object might look like this:

144p/something/file_000.ts