Intermittent response when querying Cluster IP in Kubernetes cluster

265 Views Asked by At

We're experiencing an issue when we try and query one service in our Kubernetes cluster from another pod. Occasionally, the request goes through but 99% of the time it fails. This happens also when we try and directly hit our kube-dns service:

/ # nslookup kubernetes.default.svc.cluster.local.
;; connection timed out; no servers could be reached

I can see the the above request come through in the core-dns logs so I think its not a DNS resolution issue:

[INFO] 10.2.56.172:53295 - 403 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000158582s

Similar failures happen when trying to hit any service in our cluster. Here you can see it work once and then fail right after

/ # dig http://172.20.234.169:80

; <<>> DiG 9.11.6-P1 <<>> http://172.20.234.169:80
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 2504
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 16db935520e8e061 (echoed)
;; QUESTION SECTION:
;http://172.20.234.169:80.  IN  A

;; AUTHORITY SECTION:
.           30  IN  SOA a.root-servers.net. nstld.verisign-grs.com. 2021072201 1800 900 604800 86400

;; Query time: 52 msec
;; SERVER: 172.20.0.10#53(172.20.0.10)
;; WHEN: Thu Jul 22 20:12:33 UTC 2021
;; MSG SIZE  rcvd: 140

/ # dig http://172.20.234.169:80

; <<>> DiG 9.11.6-P1 <<>> http://172.20.234.169:80
;; global options: +cmd
;; connection timed out; no servers could be reached

Our setup:

  1. Cloud Provider: AWS running EKS
  2. 4 nodes running on k8s 1.20; core-dns 1.8
  3. Setup is completely private running in its own VPC with 4 subnets

Other info

⇒  kubectl get pods -n kube-system
NAME                                          READY   STATUS    RESTARTS   AGE
aws-node-8n9r6                                1/1     Running   0          2d2h
aws-node-gpd5p                                1/1     Running   0          2d2h
aws-node-mdl98                                1/1     Running   0          2d3h
aws-node-tff7q                                1/1     Running   0          2d3h
coredns-55cd7f87dc-csnnk                      1/1     Running   0          4h3m
coredns-55cd7f87dc-d4bl2                      1/1     Running   0          4h3m
coredns-55cd7f87dc-hkj85                      1/1     Running   0          4h3m
coredns-55cd7f87dc-ms4kx                      1/1     Running   0          4h3m
kube-proxy-77zdf                              1/1     Running   0          130m
kube-proxy-fv8tc                              1/1     Running   0          129m
kube-proxy-nklhv                              1/1     Running   0          129m
kube-proxy-wvvmf                              1/1     Running   0          129m
seldon-spartakus-volunteer-5b57b95596-gsk2d   1/1     Running   0          2d3h
⇒  kubectl get svc kube-dns -n kube-system
NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   172.20.0.10   <none>        53/UDP,53/TCP   22d
/ # cat /etc/resolv.conf
nameserver 172.20.0.10
search lpa.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5

Any ideas on how to resolve this?

0

There are 0 best solutions below