Load Balancing

This topic explains how to prevent, detect, and recover from network partition errors in an on-prem RabbitMQ cluster.

Prerequisites

Ensure that the following requirements are met:

  • Secret Server On-Premises is properly installed and configured in your environment. For more information about installing and configuring Secret Server, see the Secret Server documentation.

  • The cluster has an odd number of nodes (three is the most common number).

  • The RabbitMQ cluster nodes are on the same local network (cluster cannot be across WAN links).

  • A load balancer is in place to control which node is active.

RabbitMQ Cluster Best Practices:

Network Requirements

  • The RabbitMQ clusters are not configured to traverse WAN links due to latency.

  • Latency between the nodes must be <10 ms.

  • Latency between Secret Server nodes and RabbitMQ must be <100 ms.

Load‑Balancer Requirements

Your load balancer must present exactly one active RabbitMQ node at any given time; all other nodes remain passive standbys. Under no circumstances should more than one node receive new connections. RabbitMQ does not put a large resource load on servers and therefore does not need to have the load spread across servers.

  1. Single Active Node

    • Always direct all new TCP connections to one designated RabbitMQ node. This can be accomplished a few different ways depending on the capabilities of the load balancer including the following:

      • Active/Passive Failover

      • Weighted Round Robin (with a very high weight assigned to the primary node)

      • Priority-Based Pools (with the primary node in the highest priority)

    • With the default handling (client-local) your active node should be the master of all queues.

    • Configure all other cluster members as backup or standby; they accept no client traffic unless the active node fails health checks.

  2. Health-Based Failover

    • Continuously health‑check the active nodes' AMQP port every 5 seconds.

      • Port 5672 for non-SSL

      • Port 5671 for SSL

    • On any failed health check, immediately mark the active node down, drop inbound TCP (RST), and promote exactly one standby as the new active.

    • Once the original node passes three consecutive checks, it returns to standby mode—you must still have only one active.

  3. TCP Keep‑Alive & Idle Timeouts.

    • Idle‑timeout ~ 30 s to detect silent failures.

    • Enable TCP keep‑alives so half‑open sockets on the active node are torn down quickly.

  4. Client IP Whitelisting (optional)

    • Whitelist only the Secret Server app‑server IPs.

    • All other source IPs are rejected at the LB level.

Troubleshooting

Detecting a Network Partition

Checking for nodes that are down

  1. Access your RabbitMQ servers and browse to http://localhost:15672/ and log in.

  2. Select Overview on the top menu.

  3. Scroll down to see each node listed running or node not running.

Checking queues leaders

  • Select Queues and Exchanges on the top menu.

  • This shows all your queues. The “Node” column tells you which node is currently the active replica for each quorum queue.

  • A sudden change in which node appears as Leader for many queues indicates a cluster-level issue.

Checking logs for partitions

  • Review the log in C:\RabbitMq\log.

    • node partitioned: The node detected it was cut off from peers.

    • unable to contact node: Heartbeats to another node failed.

Forcing a Single RabbitMQ Cluster Node to Boot Without Partners

If a standby or minority‐paused node needs to rejoin, perform these steps:

  1. Run an elevated PowerShell prompt.

  2. cd "$env:ProgramFiles\RabbitMQ Server\rabbitmq_server-*\sbin"

  3. .\rabbitmqctl.bat stop_app

  4. .\rabbitmqctl.bat reset

  5. .\rabbitmqctl.bat force_boot

  6. .\rabbitmqctl.bat start_app

  7. .\rabbitmqctl.bat cluster_status

    Verify under "running_nodes".

Only use force_boot when absolutely necessary. A stale state on that node will be discarded.

Troubleshooting Cheat Sheet

Symptom Likely Cause Immediate Action
Secret Server cannot connect after failover. Load Balancer didn’t tear down old connections. Check Load Balancer logs for RST on AMQP port; adjust health-check method.
Standby never promotes. Load Balancer health check configured incorrectly. Verify health check tests the correct ports and fails over to another node
Many queues switch leader at once. Network drop between nodes. Inspect logs for “node partitioned”; verify network stability.
Disk-full shutdown. Disk usage exceeded limit. Free up space on disk.
Queues show different nodes as the running node instead of just the primary. Load balancer is directing traffic to multiple RabbitMQ nodes. Correct Load balancer policy and reset the RabbitMQ queues by following this Delinea Knowledge Base article.
Network Partitions occuring. Network problems or latency > 1 second between nodes Resolve network and/or latency issues and check items in this Delinea Knowlege Base article.

Additional Best Practices

  • Version uniformity: Same Erlang and RabbitMQ versions on all nodes.

  • Network time sync: Avoid clock‑skew impacting elections.