ActiveMQ Artemis HA split-brain issue on OOME crash

34 Views Asked by At

We have an ActiveMQ Artemis 2.31.x HA configuration in a primary/backup setup. Things are working fine when you gracefully scale/stop the primary pod where the backup becomes active and primary becomes the backup after restart and traffic gets routed properly to the new primary.

However, the issue happens when the primary instance crashes with OOME (OutOfMemoryError), the backup becomes active but only some of the connections go to this new primary and others remain on the original primary that restarted and became a backup after the crash. Also, I was able to run queue stats on the restarted backup which you normally can't do on a clean backup instance. It seems like the switch between primary and backup is not clean on crashes due to OOME. Is this expected? In other words, is ActiveMQ Artemis supposed to cleanly switch between live and backup even in the case of OOME?

We're using replication:

<ha-policy>
   <replication>
      <master>
         <check-for-live-server>true</check-for-live-server>
      </master>
   </replication>
</ha-policy>
1

There are 1 best solutions below

0
Justin Bertram On

If you're using a single primary/backup pair of brokers you're going to be especially susceptible to split brain. There's a handful of ways to mitigate split brain:

  • Use ZooKeeper as the arbiter of consensus. See example here.
  • Use 3 primary/backup pairs to establish a proper quorum for voting.
  • Use the basic network "pinger" functionality.

Find more details in the documentation.