Envoy routing traffic to unhealthy servers

91 Views Asked by At

I have a grpc server running in python. The flow of request is client => gce-ingress => envoy => grpc-server. The deployment yaml are listed below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: envoy-deployment
  
  labels:
    app: envoy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: envoy
  template:
    metadata:
      labels:
        app: envoy
    spec:
      containers:
      - name: envoy
        image: envoyproxy/envoy:v1.22.5
        ports:
        - containerPort: 9901
        readinessProbe:
          httpGet:
            port: 9901
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            scheme: HTTPS
        livenessProbe:
          httpGet:
            port: 9901
            httpHeaders:
              - name: x-envoy-livenessprobe
                value: healthz
            path: /healthz
            scheme: HTTPS
        volumeMounts:
        - name: config
          mountPath: /etc/envoy
        - name: certs
          mountPath: /etc/ssl/envoy
      volumes:
      - name: config
        configMap:
          name: envoy-conf
      - name: certs
        secret:
          secretName: secret-tls
---
apiVersion: v1
kind: Service
metadata:
  name: envoy-deployment-service
  
  annotations:
    cloud.google.com/backend-config: '{"ports": {"443":"envoy-app-backend-config"}}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  ports:
  - protocol: TCP
    port: 443
    targetPort: 9901
  selector:
    app: envoy
  type: NodePort
  externalTrafficPolicy: Local
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-server
  
  labels:
    app: server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: server
  template:
    metadata:
      labels:
        app: server
    spec:
      containers:
        name: server
        image: gcr.io/project/image:latest
        command: ["python3" , "/var/app/api/main.py"]
        imagePullPolicy: Always
        volumeMounts:
        - mountPath: /secrets/gcloud-auth
          name: gcloud-auth
          readOnly: true
        ports:
        - containerPort: 8000
      volumes:
      - name: gcloud-auth
        secret:
          secretName: gcloud
---
apiVersion: v1
kind: Service
metadata:
  name: app-server-headless
  
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app: server
  ports:
    - name: grpc
      port: 8000
      targetPort: 8000
      protocol: TCP

My envoy config looks like this:-

metadata:
  name: envoy-conf
  
data:
  envoy.yaml: |
    admin:
      access_log_path: /tmp/admin_access.log
      address:
        socket_address: { address: 127.0.0.1, port_value: 9902 }

    static_resources:
      listeners:
        - name: listener_0
          address:
            socket_address: { address:  0.0.0.0, port_value: 9901 }
          filter_chains:
            - filters:
              - name: envoy.filters.network.http_connection_manager
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                  codec_type: auto
                  access_log:
                  - name: envoy.access_loggers.file
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                      path: "/dev/stdout"
                  stat_prefix: ingress_https
                  route_config:
                    name: local_route
                    virtual_hosts:
                      - name: envoy_service
                        domains: ["*"]
                        routes:
                        #- match:
                        #   prefix: "/healthz"
                        #  direct_response: { status: 200, body: { inline_string: "ok it is working now" } }
                        - match:
                           prefix: "/heal"
                          direct_response: { status: 200, body: { inline_string: "ok heal is working now" } }
                        
                        - match:
                            prefix: "/envoy/"
                          route: {
                            prefix_rewrite: "/",
                            cluster: envoy_service
                          }
    
                        - match:
                            prefix: "/"
                          route: {
                            prefix_rewrite: "/",
                            cluster: envoy_service
                          }
                        cors:
                          allow_origin_string_match:
                            - prefix: "*"
                          allow_methods: GET, PUT, DELETE, POST, OPTIONS
                          allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
                          max_age: "1728000"
                          expose_headers: custom-header-1,grpc-status,grpc-message
                  http_filters:
                    - name: envoy.filters.http.cors
                      typed_config:
                        "@type": type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors
                    - name: envoy.filters.http.grpc_web
                      typed_config:
                        "@type": type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb
                    - name: envoy.filters.http.health_check
                      typed_config:
                        "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                        pass_through_mode: false
                        headers:
                        - name: ":path"
                          exact_match: "/healthz"
                        - name: "x-envoy-livenessprobe"
                          exact_match: "healthz"
                    - name: envoy.filters.http.router
                      typed_config:
                        "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                    
              transport_socket:
                  name: envoy.transport_sockets.tls
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
                    require_client_certificate: false
                    common_tls_context:
                      tls_certificates:
                      - certificate_chain:
                          filename: /etc/ssl/envoy/tls.crt
                        private_key:
                          filename: /etc/ssl/envoy/tls.key
                      alpn_protocols: [ "h2,http/1.1" ]
      clusters:
        - name: envoy_service
          connect_timeout: 0.50s
          type: strict_dns
          http2_protocol_options: {}
          lb_policy: round_robin
          common_lb_config:
              healthy_panic_threshold:
                value: 0.0
          load_assignment:
            cluster_name: envoy_service
            endpoints:
              - lb_endpoints:
                - endpoint:
                    address:
                      socket_address:
                        address: app-server-headless
                        port_value: 8000
          health_checks:
            timeout: 1s
            interval: 5s
            unhealthy_threshold: 2
            healthy_threshold: 2
            grpc_health_check:
                service_name: envoy_service.Health

Everything seems to be working fine except when a new version app-server deployment is done. The request starts to fail during the interval where new pod being ready and old pod being terminated. The grpc client request fails with error no healthy upstream UNAVAILABLE, (14, 'unavailable'). This only happens during the time the older revision of pod is being terminated. Should envoy automatically handle this? What additional config is required?

The api server has a grpc health check and handle_sigterm method which gracefully shuts down the server in case a shutdown signal is received. The api grpc server and the health server runs on same port.

class HealthServicer(health_pb2_grpc.HealthServicer):
    def __init__(self):
        self.status = health_pb2.HealthCheckResponse.SERVING

def Check(self, request, context):
    response = health_pb2.HealthCheckResponse()
    response.status = self.status
    LOG.info(f"server is healthy with response {response}")
    return response

if __name__ == "__main__": 
    server = grpc.server(
        futures.ThreadPoolExecutor(max_workers=40),
        interceptors=(AuthInterceptor(access_key),),
    )
    api_pb2_grpc.add_APIServiceServicer_to_server(
        APIServiceServicer(args.env), server
    )
    health_pb2_grpc.add_HealthServicer_to_server(HealthServicer(), server)
    server.add_insecure_port("0.0.0.0:8000")
    server.start()
    LOG.info("server started")

    def handle_sigterm(*_):
        LOG.error("Received shutdown signal")
        # do something
        all_rpc_done_event = server.stop(10)
        all_rpc_done_event.wait(10)
        LOG.error("Shut down gracefully")
   signal(SIGTERM, handle_sigterm)
   signal(SIGINT, handle_sigterm)
   server.wait_for_termination()

The thing is that the errors thrown at the client is correlated to time passed in server.stop() and event.wait() method. If I increase the time to 30s then I see more number of errors as compared to when time is 10s. Shouldn't envoy's health check automatically detect unhealthy pods and stop routing traffic to it? What am I missing here?

0

There are 0 best solutions below