Airflow KubernetesPodOperator Task Terminated with SIGTERM Before Reaching Execution Timeout

151 Views Asked by At

I am running an Airflow DAG that utilizes the KubernetesPodOperator to execute tasks in a Kubernetes cluster. However, I've encountered an issue where several of my tasks are being terminated with a SIGTERM signal before it reaches its defined execution_timeout.

Notice that before I updated Airflow, I don't recall seeing this issue. I honestly don't think it has to be with the Airflow version but a misconfiguration in my DAGs or in Kubernetes-Helm

My packages

awswrangler==2.19.0
apache-airflow==2.7.1
apache-airflow-providers-cncf-kubernetes==7.5.1
apache-airflow-providers-amazon==8.6.0
boto3==1.28.39
gnupg==2.3.1
PyYAML==6.0.1

Here's the error I'm seeing in the logs:

[2023-10-09, 13:52:40 UTC] {local_task_job_runner.py:115} ERROR - Received SIGTERM. Terminating subprocesses
...
[2023-10-09, 13:52:40 UTC] {taskinstance.py:1630} ERROR - Received SIGTERM. Terminating subprocesses.
...
[2023-10-09, 13:52:40 UTC] {taskinstance.py:1935} ERROR - Task failed with exception

In my DAG, I've set the execution_timeout for each task to 24 hours:

execution_timeout=timedelta(hours=24)

However, the task is being terminated around the 11-hour mark or earlier.

Here's a snippet from my DAG:

with DAG(
    ...
) as dag:
    my_process = KubernetesPodOperator(
        ...
        execution_timeout=timedelta(hours=24),
        container_resources=RESOURCES['medium']
        ...
        **pod_args
    )
    ...

And from my config.py:

RESOURCES = {
    ...
    'medium': client.V1ResourceRequirements(
        requests={"cpu": "2000m", "memory": "2Gi"},
        limits={"cpu": "2000m", "memory": "2Gi"}
    ),
    ...
}

pod_args = {
    'namespace': "airflow",
    'service_account_name': "airflow",
    'image_pull_secrets': [k8s.V1LocalObjectReference("docker-registry")],
    'env_vars': {
        "EXECUTION_DATE": "{{ execution_date }}",
    },
    'in_cluster': True,
    'get_logs': True,
    'on_finish_action': 'delete_succeeded_pod',
}

I've ensured that the resources are appropriately allocated and there are no issues with the Kubernetes cluster itself.

Has anyone encountered a similar issue or can provide insights into why the task might be receiving a SIGTERM before the execution_timeout is reached? Any help or guidance would be greatly appreciated!

0

There are 0 best solutions below