Event based architecture with Kafka and Airflow

36 Views Asked by At

I'm looking for options to transform and migrate data from a source postgres backed app to another postgres table in a different database.

  1. We need to first make an initial one time load
  2. Then continue to ingest incremental changes

The producer of the data will push the changes to a Kafka Topic. Im looking for ways for me to consume this Kafka topic and upsert the data to target table with minimum latency possible (possibly < 5 mins)

One possible architecture came to my mind is to have a consumer that continuously polls Kafka topic and applies the transformations to the target table.

Im also interested to understand if I can have a sensor in Airflow that can listen to a Kafka Topic to detect any new messages in the topic after the last processed offset and then trigger a task to start consuming to the end of the topic offset. Once transformation is completed and upserted to target database for polled offsets, commit the last processed offset back to kafka.

Some questions I have for above pattern:

  1. My understanding is that the kafka sensors in airflow get messages one by one and applies the transformation function and then commits the offset back. If in any case the velocity of the messages in topic increases, will this cause issues in airflow ? Will for each message, airflow will trigger a sensor action.

Ideally what I would want be to have a task that sense presence of messages in topic past last processed offset. And then have a task that starts to read the topic and process messages to the end of topic for some time continuously and stop the consumer when it doesn't see any messages say for x seconds or minutes. So ideally, the sensor acts as a notification for starting another job / task.

What Im trying to avoid in Airflow is to have long list of tasks for each message received. I dont expect messages in the topic for say, during post work hours . i expect messages in unexpected velocities during work hours. Hence I would like to avoid any kind of streaming solution like spark or flink that runs continuously. I could use spark's trigger once option but I need to understand how to have that event trigger based on airflow and Kafka.

Another option i see, is to have Kafka connect that continuously pushes the data to a postgres table in target and then have a scheduled job in Airflow that runs every 1 minute to identify incremental rows based on a metadata tracking mechanism via metadata table etc that tracks the last processed event_time and applies transformation and upsert to target table.

I would like to understand your opinion on both the architectures

0

There are 0 best solutions below