I have a grpc server running in python. The flow of request is client => gce-ingress => envoy => grpc-server. The deployment yaml are listed below.
apiVersion: apps/v1
kind: Deployment
metadata:
name: envoy-deployment
labels:
app: envoy
spec:
replicas: 1
selector:
matchLabels:
app: envoy
template:
metadata:
labels:
app: envoy
spec:
containers:
- name: envoy
image: envoyproxy/envoy:v1.22.5
ports:
- containerPort: 9901
readinessProbe:
httpGet:
port: 9901
httpHeaders:
- name: x-envoy-livenessprobe
value: healthz
path: /healthz
scheme: HTTPS
livenessProbe:
httpGet:
port: 9901
httpHeaders:
- name: x-envoy-livenessprobe
value: healthz
path: /healthz
scheme: HTTPS
volumeMounts:
- name: config
mountPath: /etc/envoy
- name: certs
mountPath: /etc/ssl/envoy
volumes:
- name: config
configMap:
name: envoy-conf
- name: certs
secret:
secretName: secret-tls
---
apiVersion: v1
kind: Service
metadata:
name: envoy-deployment-service
annotations:
cloud.google.com/backend-config: '{"ports": {"443":"envoy-app-backend-config"}}'
cloud.google.com/neg: '{"ingress": true}'
spec:
ports:
- protocol: TCP
port: 443
targetPort: 9901
selector:
app: envoy
type: NodePort
externalTrafficPolicy: Local
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-server
labels:
app: server
spec:
replicas: 1
selector:
matchLabels:
app: server
template:
metadata:
labels:
app: server
spec:
containers:
name: server
image: gcr.io/project/image:latest
command: ["python3" , "/var/app/api/main.py"]
imagePullPolicy: Always
volumeMounts:
- mountPath: /secrets/gcloud-auth
name: gcloud-auth
readOnly: true
ports:
- containerPort: 8000
volumes:
- name: gcloud-auth
secret:
secretName: gcloud
---
apiVersion: v1
kind: Service
metadata:
name: app-server-headless
spec:
type: ClusterIP
clusterIP: None
selector:
app: server
ports:
- name: grpc
port: 8000
targetPort: 8000
protocol: TCP
My envoy config looks like this:-
metadata:
name: envoy-conf
data:
envoy.yaml: |
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 127.0.0.1, port_value: 9902 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: auto
access_log:
- name: envoy.access_loggers.file
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: "/dev/stdout"
stat_prefix: ingress_https
route_config:
name: local_route
virtual_hosts:
- name: envoy_service
domains: ["*"]
routes:
#- match:
# prefix: "/healthz"
# direct_response: { status: 200, body: { inline_string: "ok it is working now" } }
- match:
prefix: "/heal"
direct_response: { status: 200, body: { inline_string: "ok heal is working now" } }
- match:
prefix: "/envoy/"
route: {
prefix_rewrite: "/",
cluster: envoy_service
}
- match:
prefix: "/"
route: {
prefix_rewrite: "/",
cluster: envoy_service
}
cors:
allow_origin_string_match:
- prefix: "*"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
max_age: "1728000"
expose_headers: custom-header-1,grpc-status,grpc-message
http_filters:
- name: envoy.filters.http.cors
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors
- name: envoy.filters.http.grpc_web
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb
- name: envoy.filters.http.health_check
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
pass_through_mode: false
headers:
- name: ":path"
exact_match: "/healthz"
- name: "x-envoy-livenessprobe"
exact_match: "healthz"
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
require_client_certificate: false
common_tls_context:
tls_certificates:
- certificate_chain:
filename: /etc/ssl/envoy/tls.crt
private_key:
filename: /etc/ssl/envoy/tls.key
alpn_protocols: [ "h2,http/1.1" ]
clusters:
- name: envoy_service
connect_timeout: 0.50s
type: strict_dns
http2_protocol_options: {}
lb_policy: round_robin
common_lb_config:
healthy_panic_threshold:
value: 0.0
load_assignment:
cluster_name: envoy_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: app-server-headless
port_value: 8000
health_checks:
timeout: 1s
interval: 5s
unhealthy_threshold: 2
healthy_threshold: 2
grpc_health_check:
service_name: envoy_service.Health
Everything seems to be working fine except when a new version app-server deployment is done. The request starts to fail during the interval where new pod being ready and old pod being terminated. The grpc client request fails with error no healthy upstream UNAVAILABLE, (14, 'unavailable'). This only happens during the time the older revision of pod is being terminated. Should envoy automatically handle this? What additional config is required?
The api server has a grpc health check and handle_sigterm method which gracefully shuts down the server in case a shutdown signal is received. The api grpc server and the health server runs on same port.
class HealthServicer(health_pb2_grpc.HealthServicer):
def __init__(self):
self.status = health_pb2.HealthCheckResponse.SERVING
def Check(self, request, context):
response = health_pb2.HealthCheckResponse()
response.status = self.status
LOG.info(f"server is healthy with response {response}")
return response
if __name__ == "__main__":
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=40),
interceptors=(AuthInterceptor(access_key),),
)
api_pb2_grpc.add_APIServiceServicer_to_server(
APIServiceServicer(args.env), server
)
health_pb2_grpc.add_HealthServicer_to_server(HealthServicer(), server)
server.add_insecure_port("0.0.0.0:8000")
server.start()
LOG.info("server started")
def handle_sigterm(*_):
LOG.error("Received shutdown signal")
# do something
all_rpc_done_event = server.stop(10)
all_rpc_done_event.wait(10)
LOG.error("Shut down gracefully")
signal(SIGTERM, handle_sigterm)
signal(SIGINT, handle_sigterm)
server.wait_for_termination()
The thing is that the errors thrown at the client is correlated to time passed in server.stop() and event.wait() method. If I increase the time to 30s then I see more number of errors as compared to when time is 10s. Shouldn't envoy's health check automatically detect unhealthy pods and stop routing traffic to it? What am I missing here?