I use MemeberLeft event to clean up legacy data. I find the node(Ip: 192.168.12.212) leave and new node(Ip: 192.168.12.250) start up, but no MemberLeft event log can be found. Only MemberJoined event can be found. node(Ip: 192.168.12.212) could have JVM problem.
Log as follows:
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,195[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mLooking up [Lookup(guandata-server,None,Some(tcp))]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,195[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.d.k.KubernetesApiServiceDiscovery[0;39m - [36mQuerying for pods with label selector: [app=guandata-server]. Namespace: [default]. Port: [None]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,330[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mLocated service members based on: [Lookup(guandata-server,None,Some(tcp))]: [ResolvedTarget(192-168-64-250.default.pod.cluster.local,None,Some(/192.168.64.250)), ResolvedTarget(192-168-66-128.default.pod.cluster.local,None,Some(/192.168.66.128)), ResolvedTarget(192-168-65-229.default.pod.cluster.local,None,Some(/192.168.65.229))], filtered to [192-168-64-250.default.pod.cluster.local:0, 192-168-66-128.default.pod.cluster.local:0, 192-168-65-229.default.pod.cluster.local:0]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,351[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-26[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mLocated service members based on: [Lookup(guandata-server,None,Some(tcp))]: [ResolvedTarget(192-168-64-250.default.pod.cluster.local,None,Some(/192.168.64.250)), ResolvedTarget(192-168-66-128.default.pod.cluster.local,None,Some(/192.168.66.128)), ResolvedTarget(192-168-65-229.default.pod.cluster.local,None,Some(/192.168.65.229))], filtered to [192-168-64-250.default.pod.cluster.local:0, 192-168-66-128.default.pod.cluster.local:0, 192-168-65-229.default.pod.cluster.local:0]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,445[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mContact point [akka://[email protected]:25520] returned [2] seed-nodes [akka://[email protected]:25520, akka://[email protected]:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,461[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mContact point [akka://[email protected]:25520] returned [2] seed-nodes [akka://[email protected]:25520, akka://[email protected]:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,465[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mJoining [akka://[email protected]:25520] to existing cluster [akka://[email protected]:25520, akka://[email protected]:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,633[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.c.HttpClusterBootstrapRoutes[0;39m - [36mBootstrap request from 192.168.64.250:49804: Contact Point returning 0 seed-nodes []
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,736[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.a.LocalActorRef[0;39m - [36mMessage [akka.management.cluster.bootstrap.contactpoint.HttpBootstrapJsonProtocol$SeedNodes] from Actor[akka://ClusterSystem/system/bootstrapCoordinator/contactPointProbe-192-168-64-250.default.pod.cluster.local-8558#-1278944557] to Actor[akka://ClusterSystem/system/bootstrapCoordinator/contactPointProbe-192-168-64-250.default.pod.cluster.local-8558#-1278944557] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://ClusterSystem/system/bootstrapCoordinator/contactPointProbe-192-168-64-250.default.pod.cluster.local-8558#-1278944557] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,984[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.c.Cluster[0;39m - [36mCluster Node [akka://[email protected]:25520] - Received InitJoinAck message from [Actor[akka://[email protected]:25520/system/cluster/core/daemon#-1974909478]] to [akka://[email protected]:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,094[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ma.c.Cluster[0;39m - [36mCluster Node [akka://[email protected]:25520] - Welcome from [akka://[email protected]:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,110[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m===> message received: MemberJoined(Member(akka://[email protected]:25520, Joining))
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,115[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List()
[Cluster State] members: List(Member(akka://[email protected]:25520, Joining), Member(akka://[email protected]:25520, Up), Member(akka://[email protected]:25520, Up))
[Cluster State] unreachable: List()
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,124[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m[MemberUp] ===> 192.168.65.229 up cluster
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,124[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List()
[Cluster State] members: List(Member(akka://[email protected]:25520, Joining), Member(akka://[email protected]:25520, Up), Member(akka://[email protected]:25520, Up))
[Cluster State] unreachable: List()
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,124[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m[MemberUp] ===> 192.168.66.128 up cluster
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,125[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List()
[Cluster State] members: List(Member(akka://[email protected]:25520, Joining), Member(akka://[email protected]:25520, Up), Member(akka://[email protected]:25520, Up))
[Cluster State] unreachable: List()
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,988[0;39m [[31mmain[0;39m] [35mc.g.f.i.GuandataFileSystemFactory[0;39m - [36mfile:/// fileSystem has been created ......
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:25,070[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m[MemberUp] ===> 192.168.64.250 up cluster
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:25,071[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.c.s.SplitBrainResolver[0;39m - [36mThis node is now the leader responsible for taking SBR decisions among the reachable nodes (more leaders may exist).
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:25,070[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List(akka://[email protected]:25520)
[Cluster State] members: List(Member(akka://[email protected]:25520, Up), Member(akka://[email protected]:25520, Up), Member(akka://[email protected]:25520, Up))
[Cluster State] unreachable: List()
I want to known why akka clustering doesn't send MemberEvent. And which event I can use to judge the node status?
MemberLeftonly happens when a node a goes fromUptoLeaving, which is the "graceful exit". If the node did not leave gracefully (e.g. the JVM it was running on crashed, or a network issue disrupted connectivity, or it was too busy to send heartbeats due to workload (or GC pause or CPU oversubscription...)), then it will go through a different path throughDown.The
MemberRemovedevent is probably what you're looking for, especially if the plan is to run a cleanup from some other node of the cluster whether the removal was graceful or not.Note that the
MemberRemovedevent might not be received if the entire cluster goes down, and that in the non-graceful case, there is no intrinsic guarantee that the affected node has actually stopped (ideally, the node has the defaultrun-coordinated-shutdown-when-down = on, but that is not technically guaranteed to happen in any finite amount of time: consider what a long GC pause or a suspend/resume would result in). In that situation where every node in the cluster crashes, manual cleanup might be required (or if the cleanup is not strictly required for correct operation, you grin and bear it), and for the second, if the cleanup would be messed up if the downed node hadn't yet stopped, delaying the actual cleanup might be a good idea.