Pgpool pod in crashloopbackoff after running for a day

317 Views Asked by At

Setup:

  1. 1 replica pgpool with 1 replica postgresql
  2. Runs fine for a day and then fails with error failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown Memory usage is around 150 MB but the limit is 500 MB and the node memory utilization is around 20%

Deployment:

kubectl describe pod pgha-pgpool-cf54985bb-lbxns
Name:             pgha-pgpool-cf54985bb-lbxns
Namespace:        <namespace>
Priority:         0
Service Account:  default
Node:             ip-172-31-22-150.us-west-1.compute.internal/172.31.22.150
Start Time:       Thu, 15 Feb 2024 14:49:50 +0530
Labels:           app.kubernetes.io/component=pgpool
                  app.kubernetes.io/instance=postgresql-ha
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=postgresql-ha
                  helm.sh/chart=postgresql-ha-9.4.11
                  pod-template-hash=cf54985bb
                  service=pgpool
Annotations:      kubectl.kubernetes.io/restartedAt: 2024-02-08T22:40:52+05:30
Status:           Running
IP:               172.31.29.198
IPs:
  IP:           172.31.29.198
Controlled By:  ReplicaSet/pgha-pgpool-cf54985bb
Containers:
  pgpool:
    Container ID:   containerd://c60a94b7c6941da5c7386a8ba7394996c32ecda0cce34ff71fe87a0ebb4e4b74
    Image:          docker.io/bitnami/pgpool:4.5.0-debian-11-r8
    Image ID:       docker.io/bitnami/pgpool@sha256:23c5a3267561ec57af19759f5eb2d47affbd77f25d931bc4e777dcb73cd145ce
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 05:30:00 +0530
      Finished:     Fri, 16 Feb 2024 10:09:32 +0530
    Ready:          False
    Restart Count:  190
    Limits:
      cpu:     1
      memory:  500Mi
    Requests:
      cpu:      1
      memory:   300Mi
    Liveness:   exec [/opt/bitnami/scripts/pgpool/healthcheck.sh] delay=30s timeout=10s period=20s #success=1 #failure=5
    Readiness:  exec [bash -ec PGPASSWORD=${PGPOOL_POSTGRES_PASSWORD} psql -U "admin" -d "keycloak" -h /opt/bitnami/pgpool/tmp -tA -c "SELECT 1" >/dev/null] delay=5s timeout=10s period=20s #success=1 #failure=5
    Environment:
      BITNAMI_DEBUG:                                       false
      PGPOOL_BACKEND_NODES:                                0:pgha-postgresql-0.pgha-postgresql-headless:5432,
      PGPOOL_SR_CHECK_USER:                                repmgr
      PGPOOL_SR_CHECK_PASSWORD:                            <set to the key 'repmgr-password' in secret 'pgha-postgresql'>  Optional: false
      PGPOOL_SR_CHECK_DATABASE:                            postgres
      PGPOOL_ENABLE_LDAP:                                  no
      PGPOOL_POSTGRES_USERNAME:                            admin
      PGPOOL_POSTGRES_PASSWORD:                            <set to the key 'postgresql-password' in secret 'pgha-postgresql'>  Optional: false
      PGPOOL_ADMIN_USERNAME:                               pgpool
      PGPOOL_ADMIN_PASSWORD:                               <set to the key 'admin-password' in secret 'pgha-pgpool'>  Optional: false
      PGPOOL_AUTHENTICATION_METHOD:                        scram-sha-256
      PGPOOL_ENABLE_LOAD_BALANCING:                        yes
      PGPOOL_DISABLE_LOAD_BALANCE_ON_WRITE:                transaction
      PGPOOL_ENABLE_LOG_CONNECTIONS:                       no
      PGPOOL_ENABLE_LOG_HOSTNAME:                          yes
      PGPOOL_ENABLE_LOG_PER_NODE_STATEMENT:                no
      PGPOOL_NUM_INIT_CHILDREN:                            3
      PGPOOL_MAX_POOL:                                     20
      PGPOOL_CHILD_MAX_CONNECTIONS:                        100
      PGPOOL_CHILD_LIFE_TIME:
      PGPOOL_ENABLE_TLS:                                   no
      NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME:          k8s-cluster
      NEW_RELIC_METADATA_KUBERNETES_NODE_NAME:              (v1:spec.nodeName)
      NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME:        akridata (v1:metadata.namespace)
      NEW_RELIC_METADATA_KUBERNETES_POD_NAME:              pgha-pgpool-cf54985bb-lbxns (v1:metadata.name)
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME:        pgpool
      NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME:  docker.io/bitnami/pgpool:4.5.0-debian-11-r8
      NEW_RELIC_METADATA_KUBERNETES_DEPLOYMENT_NAME:       pgha-pgpool
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cxcdq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-cxcdq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  4m31s (x4441 over 15h)  kubelet  Back-off restarting failed container pgpool in pod pgha-pgpool-cf54985bb-lbxns_akridata(f62400e2-2152-43eb-a78a-941896cff390)
1

There are 1 best solutions below

3
Sai Chandra Gadde On

As the error suggest you are getting OOM killed error but the exit code is 128 try below troubleshooting steps to resolve your issue:

Exit 128:

As per this document by Nir Shtein

Exit Code 128 indicates that code within the container triggered an exit command but did not provide a valid exit code. The Linux exit command only allows integers between 0-255, so if the process was exited with, for example, exit code 3.5, the logs will report Exit Code 128.

To troubleshoot this error, check the container logs and pod events for more information using the following command:

Kubectl logs <pod name> -c <container - name>

But the main issue here is OOM killed, container logs show exit code 128 but error is reg OOM killed.

The exit code for OOM killed is 137, whereas the exit code 137 is a result of 128 + 9, where 9 corresponds to the SIGKILL signal, indicating that the process was forcefully terminated.

OOM killed has many reasons but as per you logs you getting memory limit is too low you to increase the limit, you can do that by adding limit and requests in the yaml file by using command kubectl edit deployment/myapp-deployment,

 containers:
    - image: nginx
      imagePullPolicy: Always
      name: default-mem-demo-ctr
      resources:
        limits:
          memory: 3Gi  #<--------------This is limit
        requests:
          memory: 1Gi #<--------------Your will use memory in between 1Gb to 3GB

And redeploy the deployment using kubectl apply -f command, and let me know if this help you issue.

To Diagnosis and Resolution steps to avoid "OOM killed error" follow this document and blog by Nir Shtein.