One consumer to multiple tables or many consumers per table

Question

One consumer to multiple tables or many consumers per table

488 Views Asked by friartuck At 05 October 2022 at 03:35

I have a kafka topic with millions of sale events. I have a consumer which on every message will insert the data into 4 table: 1 for the raw sales, 1 for the sales sum by date by product category (date, product_category, sale_sum) 1 for the sales sum by date for customer (date, customer_id, sale_sum) 1 for the sales by date for location (date, location_id, sale_sum)

I use a SQL database for storing my data, so the operations above are insert or update operations.

I am wondering, would it be better to have (i) 1 consumer insert into these 4 tables or (ii) 4 consumers, each responsible for inserting into each table?

What is best practice here?

Thanks

Original Q&A

There are 1 best solutions below

**aran** · Answer 1 · 2022-10-05T03:42:57.910000

From my point of view, you have three different alternatives. Anyway, to be honest, I'd personally choose the third one.

1 - One [consumer-producer] thread

In this scenario, you just have one thread that is responsible of:

1-Reading from Kafka
2-Process/Store in I
3-Process/Store in II
4-Process/Store in III
5-Process/Store in IV

All that, in sequential order, as you just have one thread that both consumes and process the messages.

  kafka-->(read)-->(process 1)-->(process 2)-->(process 3)-->process(4)

In this case, if any of the 2 to 5 steps gets "damaged" and the speed of processing decreases at some point, your entire process will slow down. And with that, the kafka topic's lag, which will increase as far as the thread doesn't finish the 5th step earlier than new message arrives at Kafka.

For me, this is a no-no regarding performance and fault-tolerance

2 - Four [consumer-producer]s

This uses the same paradigm as the first scenario: the thread that reads also is responsible of the processing.

But, thanks to consumer-groups, you can paralellize the whole process. Create 4 different groups and assign each one to a consumer.

For simplicity, let's just create one thread per consuemr group.

In this sceenario, you have something like:

CONSUMER CG1
1-Reading from Kafka
2-Process/Store in I

CONSUMER CG2
1-Reading from Kafka
2-Process/Store in II

CONSUMER CG3
1-Reading from Kafka
2-Process/Store in III

CONSUMER CG4
1-Reading from Kafka
2-Process/Store in IV

       |-->consumer 1-->(process1)-->T1
  kafka|-->consumer 2-->(process2)-->T2
       |-->consumer 3-->(process2)-->T3
       |-->consumer 4-->(process4)-->T4

Advantages: each thread is responsible of a limited number of tasks. This will help with the lag of each consumer group.

Furthermore, if some of the storing tasks fail or decreases its performance, that won't affect the other three threads: They will continue reading and processing from kafka by their own.

3. Decouple consuming and processing

This is by far, in my oppinion, the best possible solution.

You divide the tasks of reading and the tasks of processing. This way, you can for example launch:

One consumer thread

This just reads the messages from kafka and stores it in an in-memory queues, or similiar structures that are accesible from the worker threads, and that's all. Just continue reading and putting the message in queues.
X worker threads (in this case, 4)

This threads are responsible of getting the messages that the consumer put in the queues (or queues, depending on how you want to code it), and processing/storing the messages in each table.

Something like:

                            |--> queue1 -----> worker 1 --> T1
  kafka--->consumer--(msg)--|--> queue2 -----> worker 2 --> T2
                            |--> queue3 -----> worker 3 --> T3
                            |--> queue4 -----> worker 4 --> T4

What you get here is: paralellization, decoupling of processing and consuming. Here kafka's lag will , at 99% of the time, 0.

In this approach, the queues are the ones that act like buffers if some of the workers get stuck. The other whole system (mainly Kafka) will not be affected by the processing logic.

Note that even Kafka won't start lagging and possibly lossing messages due to retention, the internal queues must be monitorized, or configured properly to send the lagged messages inside the queue to a dead-letter queue, in order to avoid the consumer get stuck.

This is from the KafkaConsumer javadoc, which better explains the pros and contras of each paradigm:

A simple diagram showing the advantages of the third scenario:

Consumer thread just consumes. This avoids kafka lagging, delays in the data that must be processed (remember, this should be near real-time) and loss of messages because of retention kicking in.

The other x workers are responsible of the actual processing logic. If something fails in one of them, no other consumer or worker thread gets affected.

One consumer to multiple tables or many consumers per table

There are 1 best solutions below

1 - One [consumer-producer] thread

2 - Four [consumer-producer]s

3. Decouple consuming and processing

Related Questions in APACHE-KAFKA

Related Questions in STREAM-PROCESSING

Trending Questions

Popular # Hahtags

Popular Questions