I have an upstream (generated by OpnSense Nginx plugin)
upstream upstream9dbd5491033b477e84564ebe3e516c0b {
server aa.bb.cc.d1:443 weight=1 max_conns=10000 max_fails=3 fail_timeout=10;
server aa.bb.cc.d2:443 weight=1 max_conns=10000 max_fails=3 fail_timeout=10;
server aa.bb.cc.d3:443 weight=1 max_conns=10000 max_fails=3 fail_timeout=10;
}
and host aa.bb.cc.d3 is down. But Nginx does not detect the host as down, unless I add the down flag to it.
I expect Nginx to not forward any requests to the server anymore. But unfortunately, it still does (there is a significant performance change when I "down" the server manually).
Also the statistics view in OpnSense says, that server aa.bb.cc.d3 is up.
The documentation [1] is quite clear, except the following facts:
What is considered an unsuccessful attempt is defined by the proxy_next_upstream, fastcgi_next_upstream, uwsgi_next_upstream, scgi_next_upstream, memcached_next_upstream, and grpc_next_upstream directives.
Well, I have no proxy_next_upstream [2] and the default value is error:
an error occurred while establishing a connection with the server, passing a request to it, or reading the response header
But the default of proxy_next_upstream_timeout is 0:
Limits the time during which a request can be passed to the next server. The 0 value turns off this limitation.
Do these default values disable that feature completely, or what else could be the reason, that Nginx still keeps a server up, that is not reachable at all?
References:
[1] https://nginx.org/en/docs/http/ngx_http_upstream_module.html
[2] https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream
It looks like, that the feature to "mark a host down automatically after n retries" is not a basic feature a may be available in the commercial healthcheck module: https://nginx.org/en/docs/http/ngx_http_upstream_hc_module.html
The only chance to get this feature work, is to reduce
max_failsandfail_timeoutand letproxy_next_upstreamdo the job.