Listing relative file paths of large [google fuse] nested folders with through colab

55 Views Asked by At

I have mounted a cloud storage bucket containing a dataset to my google colab environment. The bucket contains a dataset with nested folders in the format /snapshotserengeti-unzipped/S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG . I also have access to .csv files listing all of the images in the dataset along with a unique identifier, and listing the unique identifiers with their annotations. I'm trying to produce a Dataset class using Pytorch to load batches of images. My class so far is here:

class SerengetiDataset(data.Dataset):
    def __init__(self, root, csv_file1, csv_file2, transform=None):
        self.root = root
        self.transform = transform
        
        self.image_info = pd.read_csv(csv_file1)
        
        self.image_info['file_exists'] = self.image_info['image_path_rel'].apply(lambda x: os.path.exists(os.path.join(root, x)))
        
        self.labels_info = pd.read_csv(csv_file2)
        
        self.annotations = pd.merge(self.image_info, self.labels_info, on='capt ure_id')
        
        self.filenames = self.annotations['image_path_rel'].tolist()
        
    def __getitem__(self, index):
        # ann_index = self.annotations.index[index]
        
        filename = self.filenames[index]
        path = os.path.join(self.root, filename)
        image = Image.open(path).convert('RGB')
        
        label = self.annotations.loc[index, 'question__species']
        
        if self.transform is not None:
            image = self.transform(image)
            
        return image,

The problem is that the mounted folder is missing certain entries that are listed in the csv file. This means that when a certain files are requested from the dataset, which lists them based on the filenames present in the images csv, the request returns an error as the file does not actually exist in the folder.

My solution was to add an additional column to the .csv holding a boolean that will inform the dataset if the corresponding image is present in the folder. I can then filter out all entries where it is set to false. My issue is that the dataset contains over 7 million images, nested in an array of folders.

I tried implementing the solution below, but the code takes over an hour to complete on a High RAM google colab environment. I believe this is due to the files being mounted with fuse and a bottleneck checking the directories. If anyone has an alternative solution or can see a way to speed up my existing code I'd really appreciate it.

import pandas as pd

images_df = pd.read_csv('images.csv')

def file_exists(row):
    filename = os.path.join('/content/datasets/snapshotserengeti-unzipped', row['image_path_rel'])
    return os.path.exists(filename)

images_df['file_exists'] = images_df.apply(file_exists, axis=1)

images_df.to_csv('images_updated.csv', index=False)import pandas as pd

images_df = pd.read_csv('images.csv')

def file_exists(row):
    filename = os.path.join('/content/datasets/snapshotserengeti-unzipped', row['image_path_rel'])
    return os.path.exists(filename)

images_df['file_exists'] = images_df.apply(file_exists, axis=1)

images_df.to_csv('images_updated.csv', index=False)
1

There are 1 best solutions below

0
Rufus On

If anyone needs it this is a working solution I've used to create the filtered images.csv file. I used the code provided here([Walking a directory tree inside a Google Cloud Platform bucket in Python][1]) to list the GCS objects by prefix, and create a set of the relative directories and then used that to filter the images.csv file of entries that don't exist in the file structure.

from google.cloud import storage

bucket = storage.Client().get_bucket('public-datasets-lila')

relative_dirs = set()
blobs = bucket.list_blobs(prefix="snapshotserengeti-unzipped/")
for blob in blobs:
    relative_dirs.add(blob.name.replace('snapshotserengeti-unzipped/', ''))

filtered_images_df = images_df[images_df['image_path_rel'].isin(relative_dirs)]
filtered_images_df.to_csv('filtered_images.csv', index=False)