Apache Flink S3 ListBucket API calls

37 Views Asked by At

We are using AWS S3 for storing flink savepoints. Ideally the expectation is that flink would mainly use GetObject and PutObject operation but in our case ListBucket is the API which is most called for the flink associated S3 bucket. Wanted to know why would flink be using ListBucket API and any way to reduce these calls?

Expectation is to reduce these calls and hence the AWS cost associated with these calls in S3.

1

There are 1 best solutions below

0
David Anderson On

Flink uses a filesystem abstraction for its checkpoints and savepoints. In the case of S3, two different implementations of this interface are available, one from Hadoop, and one from Presto.

The Hadoop S3 filesystem imitates a filesystem on top of S3:

  • before writing it checks if the "parent directory" exists
  • it creates empty marker files to mark the existence of such a parent directory
  • these existence requests are expensive and can violate read-after-create consistency. (A restore operation can fail because it looks like a state file is not there (due to caching in an S3 load balancer). Eventually the file will be visible and only then will the restore succeed.)

Presto S3 doesn't try to do that magic; it simply does PUT/GET operations. This is why the Presto S3 implementation is the recommended file system for checkpointing to S3. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/s3/#hadooppresto-s3-file-systems-plugins for more info.

Note that the Presto version isn't perfect either; it has its own issues, which may or may not affect your use case(s).