istio outlier detection breaking routing with no metrics

728 Views Asked by At

we have been using istio for some time, but have recently discovered an issue we cant explain with outlier detection. We have 50+ microservices and have discovered that on some of them "atleast 2-3" traffic does not seem to be load balancing we have tracked this down to outlier detection as once we remove it from the destination rule load balancing works correctly.

the image shows <1% of the traffic going to the pod ending in 8kh2p enter image description here
My main issue is that even tho we can replicate the issue and resolve it by removing outlier detection, We are seeing no metrics to show that the circuit breaker/outlier detection has been tripped. As per this github issue - https://github.com/istio/istio/issues/8902 - we should be able to track it with something similar to

sum(istio_requests_total{response_code="503", response_flags="UO"}) by (source_workload, destination_workload, response_code) 

I have also found some envoy documentation to where i should be able to track it with

envoy_cluster_circuit_breakers_default_cx_open

none of these metrics seem to show anything being triggered.

I do want to point out a similar post on stackoverflow.com which did not seem to fix our issue

If anyone could help figure out why things are not load balancing correctly with outlier detection on or at least a way that we can track that its being tripped would be much appreciated. -

kind: DestinationRule
apiVersion: networking.istio.io/v1alpha3
metadata:
  name: some-service-dr
  namespace: some-namespace
spec:
  host: some-service.some-namespace.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1000
        idleTimeout: 3s
        maxRequestsPerConnection: 1
      tcp:
        maxConnections: 500
    outlierDetection:
      consecutive5xxErrors: 0 (disabling as our services expect 500s back)
      consecutiveGatewayErrors: 5 (502, 503, 504 should trigger this)
      interval: 10s
      maxEjectionPercent: 50
    tls:
      mode: ISTIO_MUTUAL

Our virtual-services look like

kind: VirtualService
apiVersion: networking.istio.io/v1alpha3
metadata:
  name: some-service-vs
  namespace: some-namespace
spec:
  hosts:
    - some-service.some-namespace.svc.cluster.local
  http:
    - retries:
        attempts: 5
        perTryTimeout: 30s
        retryOn: 'connect-failure,refused-stream,reset'
      route:
        - destination:
            host: some-service.some-namespace.svc.cluster.local
            port:
              number: 80
  exportTo:
    - .

Peer Authentication

kind: PeerAuthentication
apiVersion: security.istio.io/v1beta1
metadata:
  name: some-service-tls-policy
  namespace: some-namespace
spec:
  selector:
    matchLabels:
      app: some-service
  mtls:
    mode: STRICT
  portLevelMtls: ~

Kubernetes version v1.21.x

Istio version 1.10.x

Prometheus version 2.28.x

UPDATE

I have updated our destination rule to attempt changing consecutive5xxErrors and consecutiveGatewayErrors both to 0 and the issue still persists with 2 pods one pod takes 100% of traffic with no traffic being loadbalanced to the other one. New setting below

outlierDetection:
  interval: 10s
  maxEjectionPercent: 50
0

There are 0 best solutions below