Load Balancing
This topic explains how to prevent, detect, and recover from network partition errors in an on-prem RabbitMQ cluster.
Prerequisites
Ensure that the following requirements are met:
-
Secret Server On-Premises is properly installed and configured in your environment. For more information about installing and configuring Secret Server, see the Secret Server documentation.
-
The cluster has an odd number of nodes (three is the most common number).
-
The RabbitMQ cluster nodes are on the same local network (cluster cannot be across WAN links).
-
A load balancer is in place to control which node is active.
RabbitMQ Cluster Best Practices:
Network Requirements
-
The RabbitMQ clusters are not configured to traverse WAN links due to latency.
-
Latency between the nodes must be <10 ms.
-
Latency between Secret Server nodes and RabbitMQ must be <100 ms.
Load‑Balancer Requirements
Your load balancer must present exactly one active RabbitMQ node at any given time; all other nodes remain passive standbys. Under no circumstances should more than one node receive new connections. RabbitMQ does not put a large resource load on servers and therefore does not need to have the load spread across servers.
-
Single Active Node
-
Always direct all new TCP connections to one designated RabbitMQ node. This can be accomplished a few different ways depending on the capabilities of the load balancer including the following:
-
Active/Passive Failover
-
Weighted Round Robin (with a very high weight assigned to the primary node)
-
Priority-Based Pools (with the primary node in the highest priority)
-
-
With the default handling (client-local) your active node should be the master of all queues.
-
Configure all other cluster members as backup or standby; they accept no client traffic unless the active node fails health checks.
-
-
Health-Based Failover
-
Continuously health‑check the active nodes' AMQP port every 5 seconds.
-
Port 5672 for non-SSL
-
Port 5671 for SSL
-
-
On any failed health check, immediately mark the active node down, drop inbound TCP (RST), and promote exactly one standby as the new active.
-
Once the original node passes three consecutive checks, it returns to standby mode—you must still have only one active.
-
-
TCP Keep‑Alive & Idle Timeouts.
-
Idle‑timeout ~ 30 s to detect silent failures.
-
Enable TCP keep‑alives so half‑open sockets on the active node are torn down quickly.
-
-
Client IP Whitelisting (optional)
-
Whitelist only the Secret Server app‑server IPs.
-
All other source IPs are rejected at the LB level.
-
Troubleshooting
Detecting a Network Partition
Checking for nodes that are down
-
Access your RabbitMQ servers and browse to http://localhost:15672/ and log in.
-
Select Overview on the top menu.
-
Scroll down to see each node listed running or node not running.
Checking queues leaders
-
Select Queues and Exchanges on the top menu.
-
This shows all your queues. The “Node” column tells you which node is currently the active replica for each quorum queue.
-
A sudden change in which node appears as Leader for many queues indicates a cluster-level issue.
Checking logs for partitions
-
Review the log in C:\RabbitMq\log.
-
node partitioned
: The node detected it was cut off from peers. -
unable to contact node
: Heartbeats to another node failed.
-
Forcing a Single RabbitMQ Cluster Node to Boot Without Partners
If a standby or minority‐paused node needs to rejoin, perform these steps:
-
Run an elevated PowerShell prompt.
-
cd "$env:ProgramFiles\RabbitMQ Server\rabbitmq_server-*\sbin"
-
.\rabbitmqctl.bat stop_app
-
.\rabbitmqctl.bat reset
-
.\rabbitmqctl.bat force_boot
-
.\rabbitmqctl.bat start_app
-
.\rabbitmqctl.bat cluster_status
Verify under "running_nodes".
Only use force_boot
when absolutely necessary. A stale state on that node will be discarded.
Troubleshooting Cheat Sheet
Symptom | Likely Cause | Immediate Action |
---|---|---|
Secret Server cannot connect after failover. | Load Balancer didn’t tear down old connections. | Check Load Balancer logs for RST on AMQP port; adjust health-check method. |
Standby never promotes. | Load Balancer health check configured incorrectly. | Verify health check tests the correct ports and fails over to another node |
Many queues switch leader at once. | Network drop between nodes. | Inspect logs for “node partitioned”; verify network stability. |
Disk-full shutdown. | Disk usage exceeded limit. | Free up space on disk. |
Queues show different nodes as the running node instead of just the primary. | Load balancer is directing traffic to multiple RabbitMQ nodes. | Correct Load balancer policy and reset the RabbitMQ queues by following this Delinea Knowledge Base article. |
Network Partitions occuring. | Network problems or latency > 1 second between nodes | Resolve network and/or latency issues and check items in this Delinea Knowlege Base article. |
Additional Best Practices
-
Version uniformity: Same Erlang and RabbitMQ versions on all nodes.
-
Network time sync: Avoid clock‑skew impacting elections.