Apache Flink S3 ListBucket API calls

37 Views Asked by Divyanshu Jaiswal At 20 February 2024 at 05:56

We are using AWS S3 for storing flink savepoints. Ideally the expectation is that flink would mainly use GetObject and PutObject operation but in our case ListBucket is the API which is most called for the flink associated S3 bucket. Wanted to know why would flink be using ListBucket API and any way to reduce these calls?

Expectation is to reduce these calls and hence the AWS cost associated with these calls in S3.

Original Q&A

There are 1 best solutions below

David Anderson On 20 February 2024 at 19:21

Flink uses a filesystem abstraction for its checkpoints and savepoints. In the case of S3, two different implementations of this interface are available, one from Hadoop, and one from Presto.

The Hadoop S3 filesystem imitates a filesystem on top of S3:

before writing it checks if the "parent directory" exists
it creates empty marker files to mark the existence of such a parent directory
these existence requests are expensive and can violate read-after-create consistency. (A restore operation can fail because it looks like a state file is not there (due to caching in an S3 load balancer). Eventually the file will be visible and only then will the restore succeed.)

Presto S3 doesn't try to do that magic; it simply does PUT/GET operations. This is why the Presto S3 implementation is the recommended file system for checkpointing to S3. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/s3/#hadooppresto-s3-file-systems-plugins for more info.

Note that the Presto version isn't perfect either; it has its own issues, which may or may not affect your use case(s).

Apache Flink S3 ListBucket API calls

There are 1 best solutions below

Related Questions in AMAZON-WEB-SERVICES

Related Questions in KUBERNETES

Related Questions in AMAZON-S3

Related Questions in APACHE-FLINK

Related Questions in SAVEPOINTS

Trending Questions

Popular # Hahtags

Popular Questions