I am running an Airflow DAG that utilizes the KubernetesPodOperator to execute tasks in a Kubernetes cluster. However, I've encountered an issue where several of my tasks are being terminated with a SIGTERM signal before it reaches its defined execution_timeout.
Notice that before I updated Airflow, I don't recall seeing this issue. I honestly don't think it has to be with the Airflow version but a misconfiguration in my DAGs or in Kubernetes-Helm
My packages
awswrangler==2.19.0
apache-airflow==2.7.1
apache-airflow-providers-cncf-kubernetes==7.5.1
apache-airflow-providers-amazon==8.6.0
boto3==1.28.39
gnupg==2.3.1
PyYAML==6.0.1
Here's the error I'm seeing in the logs:
[2023-10-09, 13:52:40 UTC] {local_task_job_runner.py:115} ERROR - Received SIGTERM. Terminating subprocesses
...
[2023-10-09, 13:52:40 UTC] {taskinstance.py:1630} ERROR - Received SIGTERM. Terminating subprocesses.
...
[2023-10-09, 13:52:40 UTC] {taskinstance.py:1935} ERROR - Task failed with exception
In my DAG, I've set the execution_timeout for each task to 24 hours:
execution_timeout=timedelta(hours=24)
However, the task is being terminated around the 11-hour mark or earlier.
Here's a snippet from my DAG:
with DAG(
...
) as dag:
my_process = KubernetesPodOperator(
...
execution_timeout=timedelta(hours=24),
container_resources=RESOURCES['medium']
...
**pod_args
)
...
And from my config.py:
RESOURCES = {
...
'medium': client.V1ResourceRequirements(
requests={"cpu": "2000m", "memory": "2Gi"},
limits={"cpu": "2000m", "memory": "2Gi"}
),
...
}
pod_args = {
'namespace': "airflow",
'service_account_name': "airflow",
'image_pull_secrets': [k8s.V1LocalObjectReference("docker-registry")],
'env_vars': {
"EXECUTION_DATE": "{{ execution_date }}",
},
'in_cluster': True,
'get_logs': True,
'on_finish_action': 'delete_succeeded_pod',
}
I've ensured that the resources are appropriately allocated and there are no issues with the Kubernetes cluster itself.
Has anyone encountered a similar issue or can provide insights into why the task might be receiving a SIGTERM before the execution_timeout is reached? Any help or guidance would be greatly appreciated!