DefaultPartitioner vs TimeBasedPartitioner S3 upload performance difference with 100 partitions and 50K flush size

201 Views Asked by At

I'm using a 100 Partition topic with 3 Replicas and 2 ISR in a MSK serverless cluster.

My EC2 instance running the Confluent S3 sink connector ingests 56 GB data from my MSK cluster in 15 minutes and uploads only 37GB data to S3 in the same time frame. The instance's resources are underutilized and I'm using a S3 endpoint which makes me think that this upload differential occurs due to my flush size and Partitioning scheme.

My S3 sink connector config.

tasks.max=50
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
flush.size=50000
rotate.interval.ms=-1
rotate.schedule.interval.ms=-1

Based on my understanding, the current config waits for 50,000 messages to accumulate for each partition before uploading the file to S3.So, if I use a Time based Hourly Partitioner, this 50k message limit would be reached much more quickly as there is only 1 partition for the 15 minute time frame instead of a 100?

Thanks in advance.

1

There are 1 best solutions below

0
OneCricketeer On

Each task has its own flush buffer. Hourly partitioner will buffer either the whole hour or dump each set of 50000 records within the hour partition, whichever occurs first.