We have a simple API service that has 2 parts
- API Gateway service
- Search Logic service
Our API Gateway service is exposed to the internet via a GKE Ingress and works fine. It performs authentication, validation and request aggregation, before sending requests on to the second Search Logic service. Our requests take at most 2-3 seconds, sometimes 5 seconds, but usually only a few hundred miliseconds. We're handling around 100-300 requests per second, with aprox 6 and 11 pods of each service respectively.
However, whenever a pod that is part of the second Search Logic service terminates (like due to a scale down event, or a rolling update) our API Gateway service gets random ECONNREFUSED errors when sending those requests to that service.
We have checked the logs in our Search service, and when these errors happen, our service doesn't actually receive those requests. We read into this and added things like a preStop hook command sleep 60 to try to account for when the service takes a while to remove the pod from the ClusterIP load balancer thing (based on what we saw here), which should delay the SIGTERM, but while that did reduce the frequency of errors we were seeing, we're still seeing them intermittently on scaling events, and very very frequently when performing a rolling update
Here's the relevant fields from our Search logic service
apiVersion: v1
kind: Service
metadata:
name: search-api
spec:
ports:
- port: 80
targetPort: 8080
selector:
app: search-api
type: ClusterIP
apiVersion: apps/v1
kind: Deployment
metadata:
name: search-api
spec:
minReadySeconds: 5
selector:
matchLabels:
app: search-api
strategy:
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
labels:
app: search-api
spec:
containers:
image: image@sha256
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- sleep
- "60"
livenessProbe:
failureThreshold: 2
httpGet:
path: /livez
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
name: search-api
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 5
resources:
limits:
cpu: 1000m
memory: 300Mi
requests:
cpu: 250m
memory: 256Mi
nodeSelector:
iam.gke.io/gke-metadata-server-enabled: "true"
serviceAccountName: my-service-account
terminationGracePeriodSeconds: 90
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: search-api
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- labelSelector:
matchLabels:
app: search-api
maxSkew: 1
topologyKey: node
whenUnsatisfiable: ScheduleAnyway
And here's a sample of the code from our API Gateway service that calls the Search Logic service, a very very simple http call
await axios({
method: 'POST',
url: `http://search-api:80/logic/endpoint`,
data: payload,
headers: {
connection: 'close',
'Content-Type': 'application/json',
Accept: 'application/json',
'X-Request-Id': request_id
}
});
We added the connection: close header as we were worried this was being caused by keep-alive connections, but it doesn't seem to have solved the problem