Load Balancing

This topic explains how to prevent, detect, and recover from network partition errors in an on-prem RabbitMQ cluster.

Prerequisites

Ensure that the following requirements are met:

  • Secret Server On-Premises is properly installed and configured in your environment. For more information about installing and configuring Secret Server, see the Secret Server documentation.

  • The cluster has an odd number of nodes (three is the most common number).

  • The RabbitMQ cluster nodes are on the same local network (cluster cannot be across WAN links).

  • A load balancer is in place to control which node is active.

RabbitMQ Cluster Best Practices:

Network Requirements

  • The RabbitMQ clusters are not configured to traverse WAN links due to latency.

  • Latency between the nodes must be <10 ms.

  • Latency between Secret Server nodes and RabbitMQ must be <100 ms.

Load‑Balancer Requirements

Your load balancer must present exactly one active RabbitMQ node at any given time; all other nodes remain passive standbys. Under no circumstances should more than one node receive new connections. RabbitMQ does not put a large resource load on servers and therefore does not need to have the load spread across servers.

Configure your load balancer as a Layer 4 (TCP) balancer to maintain persistent AMQP connections. Layer 7 balancing is not recommended for AMQP traffic because it may interfere with the protocol's stateful nature.

Single Active Node

  • Always direct all new TCP connections to one designated RabbitMQ node. This can be accomplished a few different ways depending on the capabilities of the load balancer including the following:

    • Active/Passive Failover

    • Weighted Round Robin (with a very high weight assigned to the primary node)

    • Priority-Based Pools (with the primary node in the highest priority)

  • With the default handling (client-local) your active node should be the master of all queues.

  • Configure all other cluster members as backup or standby; they accept no client traffic unless the active node fails health checks.

Health-Based Failover

  • Continuously health‑check the active node every 5 seconds.

  • On any failed health check, immediately mark the active node down, drop inbound TCP (RST), and promote exactly one standby as the new active.

  • Once the original node passes three consecutive checks, it returns to standby mode—you must still have only one active.

Health Check Methods

TCP port checks:

  • Interval: Check every 5 seconds.

  • Threshold: Mark unhealthy after 2–3 consecutive failures.

  • Method: TCP connection establishment (SYN/ACK verification).

  • Advantage: Does not require complicated setup in load balancer configuration.

Failover Routing Behavior

  • Clustered node failover: When failing over between nodes within the same RabbitMQ cluster, the load balancer transfers connections smoothly. The shared cluster state allows seamless transition.

  • Non-clustered node failover (DR scenario) When failing over to a node outside the current cluster, the load balancer must drop all existing connections. This forces Secret Server and distributed engines to re-establish connections, reconfigure queue subscriptions, and requeue any lost work items.

Connection Reset on Failover

Terminology differs between vendors, so refer to the manufacturers’ documentation on how to implement this.

  • Look for settings labeled "Connection reset," "Hard failover," "Immediate termination," or “TCP reset (RST)."

  • You may need to disable "Graceful shutdown" or “Graceful connection draining.”

This ensures that clients detect the failure immediately and reinitialize their RabbitMQ configuration.

TCP Keep‑Alive & Idle Timeouts

  • Idle‑timeout ~ 30 s to detect silent failures.

  • Enable TCP keep‑alives so half‑open sockets on the active node are torn down quickly.

Client IP Whitelisting (optional)

  • Whitelist only the Secret Server app server and distributed engine IPs.

  • All other source IPs are rejected at the LB level.

SSL/TLS Configuration

Configure TLS Pass-Through for encrypted AMQP traffic. The load balancer must forward encrypted traffic directly to RabbitMQ nodes without termination or inspection.

TLS termination at the load balancer is not supported. Clients validate certificates against the RabbitMQ servers directly.

Troubleshooting

Detecting a Network Partition

Checking for nodes that are down

  1. Access your RabbitMQ servers and browse to http://localhost:15672/ and log in.

  2. Select Overview on the top menu.

  3. Scroll down to see each node listed running or node not running.

Checking queues leaders

  • Select Queues and Exchanges on the top menu.

  • This shows all your queues. The “Node” column tells you which node is currently the active replica for each quorum queue.

  • A sudden change in which node appears as Leader for many queues indicates a cluster-level issue.

Checking logs for partitions

  • Review the log in C:\RabbitMq\log.

    • node partitioned: The node detected it was cut off from peers.

    • unable to contact node: Heartbeats to another node failed.

Forcing a Single RabbitMQ Cluster Node to Boot Without Partners

If a standby or minority‐paused node needs to rejoin, perform these steps:

  1. Run an elevated PowerShell prompt.

  2. cd "$env:ProgramFiles\RabbitMQ Server\rabbitmq_server-*\sbin"

  3. .\rabbitmqctl.bat stop_app

  4. .\rabbitmqctl.bat reset

  5. .\rabbitmqctl.bat force_boot

  6. .\rabbitmqctl.bat start_app

  7. .\rabbitmqctl.bat cluster_status

    Verify under "running_nodes".

Only use force_boot when absolutely necessary. A stale state on that node will be discarded.

Troubleshooting Cheat Sheet

Symptom Likely Cause Immediate Action
Secret Server cannot connect after failover. Load Balancer didn’t tear down old connections. Check Load Balancer logs for RST on AMQP port; adjust health-check method.
Standby never promotes. Load Balancer health check configured incorrectly. Verify health check tests the correct ports and fails over to another node
Many queues switch leader at once. Network drop between nodes. Inspect logs for “node partitioned”; verify network stability.
Disk-full shutdown. Disk usage exceeded limit. Free up space on disk.
Queues show different nodes as the running node instead of just the primary. Load balancer is directing traffic to multiple RabbitMQ nodes. Correct Load balancer policy and reset the RabbitMQ queues by following this Delinea Knowledge Base article.
Network Partitions occurring. Network problems or latency > 1 second between nodes Resolve network and/or latency issues and check items in this Delinea Knowledge Base article.
TLS handshake failures after LB implementation. Load balancer performing TLS termination or certificate missing load balancer hostname Configure TLS pass-through and disable SSL offloading. Validate certificate is valid for all applicable hostnames.

Additional Best Practices

  • Version uniformity: Same Erlang and RabbitMQ versions on all nodes.

  • Network time sync: Avoid clock‑skew impacting elections.