I have written before about the load-balanced FusionPBX cluster I offer to the public. Now, I will write about the high availability one. In order try to answer all your possible questions, I am writing this article. I hope after reading you get a clear picture of what it is and what it is not a high availability FusonPBX cluster.

The first thing I need to clarify is that a high-availability cluster is not a load-balanced one; they both are fault tolerance related. Although both kinds of approaches are not mutually exclusive (you can combine them), their pros and cons are different and the way they work as well. I will write later a comparison between them, for now just remember it is not the same.

A Cluster Overview

A high-availability cluster deployment needs at least 5 servers. The following picture shows a basic deployment.

generic fusionpbx cluster high availability

The elements are:

  • two PBX servers that are responsible for handling the SIP and RTP flow. These servers have installed FusionPBX, FreeSWITCH, Memcached, the Lua supporting scripts and more stuff. One of these servers is playing an active role, the other is a passive one.
  • two or three Database servers which are the ones to store all the information that FreeSWITCH and FusionPBX use.
  • if there are only two database servers, one Arbitrator server its only role is to avoid the brain-split issue when the communication between the two database nodes breaks.

Pros, Cons and Side Effects of a Load-balanced FusionPBX Cluster

Any cluster holder needs to remember that the cluster is far to be a stand-alone server. There are internal differences; information flows are different and as a consequence, the way it operates is not the same. First, let's list the pros:

  • Fault tolerance: when a node goes down, the other node will take the load
  • Continuity: when a node goes down, the end user will not notice more than dead air for a few seconds if using UDP.
  • 100% compatible: all phones work with this approach since they do not see the cluster. As for the end user's eyes, there is only one server.

Now the cons:

  • The passive server may be considered a waste of resources. It will be sitting and doing nothing until the active fails.
  • All the servers must be in the same data centre. As the floating IP needs to be in the same collision domain, the only way is for them to share the same network. You may work around this if you have an MPLS network between two data centres; but since this is very expensive, most data centres do not provide MPLS or BGP solutions, therefore if a data centre goes down, you may go down as well. The cheapest fix for this is combining the high availability and load balancing approaches.

Some side effects that are not good nor bad, but they are different than a stand-alone server:

  • File synchronization: as with the Memcached behaviour, this is the same, you do not know where the call is hitting. Although the PBX cluster has a synchronization mechanism (regardless of which one you selected), the important thing here is to place a synchronization policy. It is very CPU expensive to synchronize every five seconds, you will waste valuable CPU resources that will impact the quality of your service. You can opt for a five-minute synchronization policy or a midnight policy. Whatever it is, do not forget about this, it is a very common complaint about an IVR not playing the proper recording and the cause is that the file has not been synched yet.
  • Rebooting the database server: whatever reason makes you reboot a database node, never reboot them all at the same time. Reboot one, wait for it to recover, then reboot the next and so on.
  • Your router needs to have off any kind of ARP caching parameter. Some routers do cache to speed things up,  but in this case, this could do a really big harm as the floating IP technique relies on updating the ARP tables.

Information Flow

A cluster's information flow is a little different than a stand-alone FusionPBX server. In a stand-alone deployment, all flows are almost local. The only external flow is the one related to the endpoints or bridging a call to the PSTN. A cluster has some extra flows that cross among the servers.

The endpoints will not make a difference in this case, as from their point of view, they are just connecting to a single IP. The database will replicate the information with the other node.

In the very case of a high availability cluster, information flows are almost the same as a stand-alone deployment. The big difference is the external database. Although you may think why not let the database in the PBX, besides the future performance issues, when the active node goes down, the database will be as well. Depending on the kind of crash, the database could not recover easily.

Fault Tolerance in a High Availability FusionPBX Cluster

When a fault tolerance event happens, the endpoints will keep pointing to the same IP. The floating IP technique used in the high availability PBX cluster detects when the active node stops answering and it is here when the passive node takes the floating IP. When the move happens, the now new active server will notify the neighbours it is now the new holder of the floating IP and it will run a FreeSWITCH recovery process. The end user will only notice dead air for quite some seconds after that conversations will continue normally.

The following image shows the fault tolerance event.

generic fusionpbx cluster high availability active fails

When the faulting server recovers, it has a new role as the passive node. The roles will switch when the actual active node fails.

generic fusionpbx cluster high availability passive becomes new activeI hope this gives you a very clear picture of the way a high-availability FusionPBX cluster works.

How do I get a Cluster like this?

Easy, just send me a text on any of my social media accounts. I am sure we will get into an agreement and you will enjoy a brand new PBX Cluster.

Good luck!

";