ActiveMQ Artemis HA split-brain issue on OOME crash

34 Views Asked by sg2000 At 12 March 2024 at 22:30

We have an ActiveMQ Artemis 2.31.x HA configuration in a primary/backup setup. Things are working fine when you gracefully scale/stop the primary pod where the backup becomes active and primary becomes the backup after restart and traffic gets routed properly to the new primary.

However, the issue happens when the primary instance crashes with OOME (OutOfMemoryError), the backup becomes active but only some of the connections go to this new primary and others remain on the original primary that restarted and became a backup after the crash. Also, I was able to run queue stats on the restarted backup which you normally can't do on a clean backup instance. It seems like the switch between primary and backup is not clean on crashes due to OOME. Is this expected? In other words, is ActiveMQ Artemis supposed to cleanly switch between live and backup even in the case of OOME?

We're using replication:

<ha-policy>
   <replication>
      <master>
         <check-for-live-server>true</check-for-live-server>
      </master>
   </replication>
</ha-policy>

Original Q&A

There are 1 best solutions below

Justin Bertram On 15 March 2024 at 15:44

If you're using a single primary/backup pair of brokers you're going to be especially susceptible to split brain. There's a handful of ways to mitigate split brain:

Use ZooKeeper as the arbiter of consensus. See example here.
Use 3 primary/backup pairs to establish a proper quorum for voting.
Use the basic network "pinger" functionality.

Find more details in the documentation.

ActiveMQ Artemis HA split-brain issue on OOME crash

There are 1 best solutions below

Related Questions in JMS

Related Questions in ACTIVEMQ-ARTEMIS

Related Questions in HIGH-AVAILABILITY

Trending Questions

Popular # Hahtags

Popular Questions