I see a weird issue in my GKE cluster, where inter-namespace API calls take longer.
- Say I have two namespaces
ns-1andns-2. - I have an API deployed in
ns-1, let's callapi-service - if I call the
api-service, from any other app/pod deployed inns-1, it takes around 5ms. - if I call the
api-servicefrom an app/pod deployed inns-2, it takes ~40ms. To call from another namespace I am using FQDN -api-service.ns-1.svc.cluster.local - and when I am trying to call the
api-service, from ns-1also, using the FQDN -api-service.ns-1.svc.cluster.local`, it is again taking ~25-40ms.
prima-facie it looks like inter-namespace communication, or when you are using FQDN adds latency.
I tried to check the documentation for the same, but couldn't find anything mentioning the same.
Usually this shouldn't be happening but I am clueless as of now.
Any help regarding the same would be appreciated.
Note:
- I am making the same API call from both the pods
- Response time from different pods:
//calling from the same namespace (ns-1)
# curl -X POST "http://api-service:8080/hello" -w %{time_connect}:%{time_starttransfer}:%{time_total}
0.004539:0.007073:0.007096
//calling from the same namespace (ns-1)
# curl -X POST "http://api-service.ns1.svc.cluster.local:8080/hello" -w %{time_connect}:%{time_starttransfer}:%{time_total}
0.028735:0.030097:0.030158
//calling from a different namespace (ns-2)
# curl -X POST "http://api-service.ns1.svc.cluster.local:8080/hello" -w %{time_connect}:%{time_starttransfer}:%{time_total}
0.125594:0.163450:0.163519
0.028722:0.030159:0.030221
or if anyone can point me to the correct documentation for the same.
As per your question it seems there is a latency issue due to inter-namespace communication, or when you are using FQDN adds latency.
Can you try the below process, which may help to reduce the latency :
1. How to reduce Inter-namespace communication Latency:
The DNSCache from k8s and the one from GCP may have an increase of 2.5x which means that it's definitely better if you want to reduce the latency.
Try below different configurations:
a. Enable Node Local DNSCache as noted in the official k8s documentation.
b. Addon DNSCache as mentioned in GCP official documentation.
c. Without using any DNSCache, only the native kube-dns
2. How to reduce FQDN Latency:
If you are able to configure your application to make DNS queries to the hostname with a trailing dot (for example, google.com.). The reason for adding a trailing dot is right now the ndots should be the default which is set to 5. This would mean any hostname with less than 5 dots would not be treated as an FQDN and results in generating more DNS queries (roughly 6 per DNS query with an FQDN containing less than 5 dots as it will append the search domains ..svc.cluster.local, .svc.cluster.local, .cluster.local, .c..internal, and .google.internal. before attempting the hostname by itself). If you weren’t able to configure the application to make the changes the other option we could try is by implementing a specific DNS configuration on a per-deployment basis:
Above will create a
resolve.confwith thendotvalue set to 1. So any hostname with at least a single dot will be treated as a FQDN. This should alleviate some of the strain onKubeDNSandNode Local DNSCache.