Issues with NCache cluster cache running on a single cluster, after the other instance goes offline

258 Views Asked by At

We have a replicated cluster cache setup with two instances, everything runs well when both instances are on-line, and we are using Community Edition 4.8.

When we take an instance offline, cache management becomes very slow and even stopping and starting the cache from NCache Manager GUI takes a very long time and then shows a message stating that there is an instance that is un-reachable.

Also when trying to fetch data from cache or add data to it, it gives an exception of operation timeout, and there is no response from the single instance that is still running.

From what I understand, this scenario should be handled by the cache service it-self since it is replicated, and it should handle failure for an instance going offline.

Thank you,

1

There are 1 best solutions below

0
Shoeb Lodhi On

I would like to explain the cause of slowness on your application when one of the server node gets removed from the cache cluster.

What happens is whenever a node gets removed from the cache cluster, the surviving node/nodes go into recovery process and try to re-establish connection with that downed server node. By default this Connection retry value is set to “2” which means that the surviving nodes would try to reconnect with the downed node two times and after the reconnection has failed, the cache cluster would consider the downed server and offline and the cluster would start handling requests like before. This reconnection process can take up to 90 seconds as this is the default TCP/IP timeout interval and if the connection retry is set to “2” the recovery process could take up to around 200 seconds. Your application(Or NCache Manager calls) could experience slowness or request timeouts during this 2-3 minutes window when the cluster is in the recovery mode but once the recovery process is finished, the application should start working without any issues. If the slowness or request timeouts last more than a few minutes

The Connection retry value can be changed from the NCache “Config.ncconf” file. Increasing the number of connection retries would mean that the cluster would spend more time in the recovery process. The purpose of this feature is that if there is a network glitch in the environment and the server nodes lose connection with eachother, the servers would get reconnected automatically due to this recovery process. This is the reason why it is recommended to keep the Connection Retry interval set to at least 1.