In airflow with dbt, why are tasks consuming so many resources if it just sends queries to snowflake

53 Views Asked by Eugenio.Gastelum96 At 06 February 2024 at 01:36

I have an airflow deployment on AWS MWAA, and we have a dbt model which we call from a DAG. The queries on dbt are launched against a snowflake warehouse which makes the computations. However, we have a task that spans 60 concurrent subtasks, the task basically is a for loop that formats a query to change parameter.

Those 60 concurrent tasks are showing us that airflow's resources should grow by a lot. I understand concurrency has an implication in resources. However, in this case it's going up to multiple times more workers and RAM than if it was not parallel.

My question is: If snowflake warehouse is the one doing the actual processing and airflow is just the orchestrator, why is airflow demanding such increase in resources if it's just sending queries? Or am I wrong at something here? How does airflow needs multiple times more RAM to manage 60 queries that it sent to snowflake?

Original Q&A

There are 1 best solutions below

KingQQ On 06 February 2024 at 03:03

My guessing is although airflow only submit the tasks, but since the task result has been return from data sources, and they will be reused in the following tasks, that's why you can see the RAM usage increased.

It is similar with a multiple threads downloading process, the difference is downloading is write data into the disk, but the airflow has no persist steps (maybe it has), so all data which will be reused has to store in your RAM.

In airflow with dbt, why are tasks consuming so many resources if it just sends queries to snowflake

There are 1 best solutions below

Related Questions in AMAZON-WEB-SERVICES

Related Questions in AIRFLOW

Related Questions in MWAA

Trending Questions

Popular # Hahtags

Popular Questions