We're experiencing an issue when we try and query one service in our Kubernetes cluster from another pod. Occasionally, the request goes through but 99% of the time it fails. This happens also when we try and directly hit our kube-dns service:
/ # nslookup kubernetes.default.svc.cluster.local.
;; connection timed out; no servers could be reached
I can see the the above request come through in the core-dns logs so I think its not a DNS resolution issue:
[INFO] 10.2.56.172:53295 - 403 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000158582s
Similar failures happen when trying to hit any service in our cluster. Here you can see it work once and then fail right after
/ # dig http://172.20.234.169:80
; <<>> DiG 9.11.6-P1 <<>> http://172.20.234.169:80
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 2504
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 16db935520e8e061 (echoed)
;; QUESTION SECTION:
;http://172.20.234.169:80. IN A
;; AUTHORITY SECTION:
. 30 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2021072201 1800 900 604800 86400
;; Query time: 52 msec
;; SERVER: 172.20.0.10#53(172.20.0.10)
;; WHEN: Thu Jul 22 20:12:33 UTC 2021
;; MSG SIZE rcvd: 140
/ # dig http://172.20.234.169:80
; <<>> DiG 9.11.6-P1 <<>> http://172.20.234.169:80
;; global options: +cmd
;; connection timed out; no servers could be reached
Our setup:
- Cloud Provider: AWS running EKS
- 4 nodes running on k8s 1.20; core-dns 1.8
- Setup is completely private running in its own VPC with 4 subnets
Other info
⇒ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
aws-node-8n9r6 1/1 Running 0 2d2h
aws-node-gpd5p 1/1 Running 0 2d2h
aws-node-mdl98 1/1 Running 0 2d3h
aws-node-tff7q 1/1 Running 0 2d3h
coredns-55cd7f87dc-csnnk 1/1 Running 0 4h3m
coredns-55cd7f87dc-d4bl2 1/1 Running 0 4h3m
coredns-55cd7f87dc-hkj85 1/1 Running 0 4h3m
coredns-55cd7f87dc-ms4kx 1/1 Running 0 4h3m
kube-proxy-77zdf 1/1 Running 0 130m
kube-proxy-fv8tc 1/1 Running 0 129m
kube-proxy-nklhv 1/1 Running 0 129m
kube-proxy-wvvmf 1/1 Running 0 129m
seldon-spartakus-volunteer-5b57b95596-gsk2d 1/1 Running 0 2d3h
⇒ kubectl get svc kube-dns -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 172.20.0.10 <none> 53/UDP,53/TCP 22d
/ # cat /etc/resolv.conf
nameserver 172.20.0.10
search lpa.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5
Any ideas on how to resolve this?