One consumer to multiple tables or many consumers per table

488 Views Asked by At

I have a kafka topic with millions of sale events. I have a consumer which on every message will insert the data into 4 table: 1 for the raw sales, 1 for the sales sum by date by product category (date, product_category, sale_sum) 1 for the sales sum by date for customer (date, customer_id, sale_sum) 1 for the sales by date for location (date, location_id, sale_sum)

I use a SQL database for storing my data, so the operations above are insert or update operations.

I am wondering, would it be better to have (i) 1 consumer insert into these 4 tables or (ii) 4 consumers, each responsible for inserting into each table?

What is best practice here?

Thanks

1

There are 1 best solutions below

13
aran On

From my point of view, you have three different alternatives. Anyway, to be honest, I'd personally choose the third one.



1 - One [consumer-producer] thread

In this scenario, you just have one thread that is responsible of:

1-Reading from Kafka
2-Process/Store in I
3-Process/Store in II
4-Process/Store in III
5-Process/Store in IV

All that, in sequential order, as you just have one thread that both consumes and process the messages.

  kafka-->(read)-->(process 1)-->(process 2)-->(process 3)-->process(4)

In this case, if any of the 2 to 5 steps gets "damaged" and the speed of processing decreases at some point, your entire process will slow down. And with that, the kafka topic's lag, which will increase as far as the thread doesn't finish the 5th step earlier than new message arrives at Kafka.

For me, this is a no-no regarding performance and fault-tolerance



2 - Four [consumer-producer]s

This uses the same paradigm as the first scenario: the thread that reads also is responsible of the processing.

But, thanks to consumer-groups, you can paralellize the whole process. Create 4 different groups and assign each one to a consumer.

For simplicity, let's just create one thread per consuemr group.

In this sceenario, you have something like:

CONSUMER CG1
1-Reading from Kafka
2-Process/Store in I

CONSUMER CG2
1-Reading from Kafka
2-Process/Store in II

CONSUMER CG3
1-Reading from Kafka
2-Process/Store in III

CONSUMER CG4
1-Reading from Kafka
2-Process/Store in IV

       |-->consumer 1-->(process1)-->T1
  kafka|-->consumer 2-->(process2)-->T2
       |-->consumer 3-->(process2)-->T3
       |-->consumer 4-->(process4)-->T4

Advantages: each thread is responsible of a limited number of tasks. This will help with the lag of each consumer group.

Furthermore, if some of the storing tasks fail or decreases its performance, that won't affect the other three threads: They will continue reading and processing from kafka by their own.



3. Decouple consuming and processing

This is by far, in my oppinion, the best possible solution.

You divide the tasks of reading and the tasks of processing. This way, you can for example launch:

  • One consumer thread

    This just reads the messages from kafka and stores it in an in-memory queues, or similiar structures that are accesible from the worker threads, and that's all. Just continue reading and putting the message in queues.

  • X worker threads (in this case, 4)

    This threads are responsible of getting the messages that the consumer put in the queues (or queues, depending on how you want to code it), and processing/storing the messages in each table.

Something like:

                            |--> queue1 -----> worker 1 --> T1
  kafka--->consumer--(msg)--|--> queue2 -----> worker 2 --> T2
                            |--> queue3 -----> worker 3 --> T3
                            |--> queue4 -----> worker 4 --> T4

What you get here is: paralellization, decoupling of processing and consuming. Here kafka's lag will , at 99% of the time, 0.

In this approach, the queues are the ones that act like buffers if some of the workers get stuck. The other whole system (mainly Kafka) will not be affected by the processing logic.

Note that even Kafka won't start lagging and possibly lossing messages due to retention, the internal queues must be monitorized, or configured properly to send the lagged messages inside the queue to a dead-letter queue, in order to avoid the consumer get stuck.




This is from the KafkaConsumer javadoc, which better explains the pros and contras of each paradigm:

enter image description here

enter image description here


A simple diagram showing the advantages of the third scenario:

enter image description here

Consumer thread just consumes. This avoids kafka lagging, delays in the data that must be processed (remember, this should be near real-time) and loss of messages because of retention kicking in.

The other x workers are responsible of the actual processing logic. If something fails in one of them, no other consumer or worker thread gets affected.