In airflow with dbt, why are tasks consuming so many resources if it just sends queries to snowflake

53 Views Asked by At

I have an airflow deployment on AWS MWAA, and we have a dbt model which we call from a DAG. The queries on dbt are launched against a snowflake warehouse which makes the computations. However, we have a task that spans 60 concurrent subtasks, the task basically is a for loop that formats a query to change parameter.

Those 60 concurrent tasks are showing us that airflow's resources should grow by a lot. I understand concurrency has an implication in resources. However, in this case it's going up to multiple times more workers and RAM than if it was not parallel.

My question is: If snowflake warehouse is the one doing the actual processing and airflow is just the orchestrator, why is airflow demanding such increase in resources if it's just sending queries? Or am I wrong at something here? How does airflow needs multiple times more RAM to manage 60 queries that it sent to snowflake?

1

There are 1 best solutions below

2
KingQQ On

My guessing is although airflow only submit the tasks, but since the task result has been return from data sources, and they will be reused in the following tasks, that's why you can see the RAM usage increased.

It is similar with a multiple threads downloading process, the difference is downloading is write data into the disk, but the airflow has no persist steps (maybe it has), so all data which will be reused has to store in your RAM.