Impact of growing manifest file in gcloud storage cp

216 Views Asked by At

I am copying large number of files from source bucket to destination bucket where source bucket is encrypted with AES256.

gcloud storage cp is fastest option to achieve same and we can pass encryption keys.

However I want to skip files which are already copied, There is a way to pass manifest file to skip files already copied.

My concern is what happens when this manifest file grows bigger.

For e.g. for transferring data of 3.5GiB size with 837136 files created manifest file of size ~278MB.

Currently data transfer service doesn't support data transfer where source bucket is encrypted with AES256.

Question

So for transferring data size of Terabytes, this file will become even bigger then the question is how does gcloud storage cp handles and reads this file ? Will the size of manifest file will become bottleneck and cause throttling issues on memory ? Is there any documentation how gcloud storage handles this ?

1

There are 1 best solutions below

1
Robert G On

Based on this Google blog on Faster Cloud Storage transfers using the gcloud command-line:

When transferring a single large file, the difference is even more pronounced. With a 10GB file, gcloud storage was 94% faster than gsutil on download and 57% faster on upload. This performance improvement comes without the need for extensive testing and tweaking, making it easy to see much faster transfer times.

Also, gcloud storage cp takes advantage of parallel composite uploads wherein a file is divided into 32 chunks and then uploaded in parallel to temporary objects, the final object is recreated using the temporary objects, and the temporary objects are deleted.

With regards to bottleneck, it is suggested to avoid the sequential naming bottleneck as this can cause an upload speed issue, since the majority of your connections will all be directed to the same shard, since the filenames are so similar. A simple solution to this issue is to simply re-name your folder or file structure such that they are no longer linear.

Here are some documentations that you may find useful and that you can test on your projects:

It is also recommended to perform resumable uploads as this is very important in case there's a network or connection interruption and you don't want to start uploading chunks of data all over again.