Is there a way to sync applications having kafka stream to avoid duplicate message processing?

23 Views Asked by At

In my spring boot application, I utilized kafka streams. It first groups messages from a certain topic by key, windows them according to certain time interval, uses reduce to keep only the latest message for every key and processes and sends the latest message in the time window for every key to another queue at the end of the time window.

stream.groupByKey()
        .windowedBy(window)
        .reduce((oldValue, newValue) -> newValue, materialized)
        .toStream()
        .process(() -> /*process and send message to a queue*/);

It works fine with only one instance but when number of instances increases, every instance processes same messages and at the queue, I see multiple messages for same key for same time window.

I want only one message corresponding to a key in time interval. I aim to have that if an instance works with messages with a key, the other instances should not process messages for this key. Is there any possible way to do that without a custom implementation on message processing logic?

1

There are 1 best solutions below

3
Matthias J. Sax On

There is two things you need to consider:

  1. Kafka Streams assumes that input data is partitioned by key. If that's not the case, you would need to replace groupByKey() with groupBy((k,v) -> k)) or repartition().groupByKey(). If data is not partitioned by key, you might get multiple windows for a single key.

  2. Kafka Streams does by default not emit a single "final" result for a windowed aggregation, but it continuously refines the result as long as the window is open, and emits intermediate results. If you want a single result per window, you can use .reduce(...).suppress(...) or .windowBy(...).emitStrategy(...) (checkout the docs for details about both options).