Single Point of failure removal

121 Views Asked by At

I have a system similar to this: enter image description here

Here Driver Service is my single point of failure as it is handling the orchestration between different services. If any other service goes down, no one else will be affected, but if Driver Service goes down, whole orchestration flow will stop How can I remove this SPOF or what's the alternating design that I should consider regarding this?

4

There are 4 best solutions below

0
tax evader On

I think one of the common ways to remove the single point of failure is to multiple distributed nodes handing requests for Driver service.

The workload can be distributed between the nodes using a load balancer. In addition to distributing the workload, the load balancer can also perform a health check to make sure the nodes handling the requests are running as expected.

This of course would make the load balancer a single point of failure. You can avoid this by having multiple load balancers whose IPs are registered under the same domain name in the DNS, so that if the request to that domain can't reach one load balancer, it will send the request to another.

1
StepUp On

It looks like some your services know how to call each other. And it means that they are dependent. It leads to tight coupling between services. If you change driver service, then you need to edit shipment service. As a result you need to redeploy all other services that work with driver service.

As an alternative, you can use message queues, communication buses. Yeah, you are right that they can become single point of failure. But in that case you can deploy multiple instances of message queue.

0
R.Abbasi On

In the design, the driver service looks like an orchestrator. Orchestrators have many advantages but they suffer from coupling and being the SOP. On the other hand, there is an alternative way to design your system. It's called choreography. In this approach, services communicate with each other via a message broker. They are decoupled but the design is much more complex. It would be harder to debug and trace bugs.

Before choosing, think about the requirements and scale of your application. Orchestration might be a good solution if availability is not an important factor or if you can make the SOP high available (by scaling and other patterns).

0
Matt Timmermans On

The "Driver Service" is a single "thing' in your diagram, but that doesn't matter.

You should have multiple instances of that service deployed across different data centers (or regions or availability zones or whatever). As long as those multiple instances are independent, then there is no single thing that can take down the system by failing.

Since the service isn't really a single thing, it isn't a "single point of failure".

Note, however, that there is an important way in which all those service instances might not be "independent" -- they may all run exactly the same software. It's important to avoid this problem using canary deployments, incremental rollout, and rollback capability.