Coordinator HA
Active/passive failover for paraloom's coordinator role with sub-30s RTO.
Coordinator high availability
Paraloom uses an active/passive coordinator model. One coordinator is primary at any time; the rest are passives that watch the primary's heartbeat and stand by to take over. Recovery time objective (RTO) is under 30 seconds in the kill-the-primary scenario test.
Why a coordinator
The coordinator role drives:
- Job distribution (compute jobs assigned across validator cohort)
- Fee distribution (withdrawal fees → leader)
- State snapshot replication (so a passive can resume mid-round if the primary dies)
It is not a single point of trust — its decisions are still verified by the BFT cohort. But it is a single point of coordination. HA exists so that a primary crash doesn't pause the network.
Roles
| Role | What it does |
|---|---|
| Primary | Handles incoming jobs, broadcasts heartbeats, replicates state to passives |
| Passive | Listens for primary heartbeat, applies replicated state, becomes primary on failover |
The role state machine lives in src/coordinator/role.rs. Snapshot data structure is in src/coordinator/mod.rs.
Heartbeat protocol
The primary broadcasts a heartbeat over the gossipsub heartbeat topic every 5 seconds. Each passive maintains a watchdog timer; if the heartbeat is missed for longer than the failover threshold, the passive starts an election.
The broadcast and watchdog loops are spawned by Node::run and aborted by Node::stop — see src/node/mod.rs.
State replication
The primary replicates a CoordinatorSnapshot containing:
- Current job queue position
- In-flight withdrawal verification rounds
- Recent fee accounting
Passives apply this snapshot to local state on each heartbeat. When a passive becomes primary, it picks up where the previous primary left off — no jobs are lost or double-assigned.
Settings
| Setting | Default | Notes |
|---|---|---|
| Heartbeat interval | 5 s | configurable |
| Failover threshold | 15 s (3 missed beats) | configurable |
| Snapshot frequency | every heartbeat | always co-broadcast |
| Election quorum | majority of registered coordinators | uses validator stake weights |
Scenario test
The kill-the-primary scenario lives in tests/coordinator_tests.rs. It boots a primary plus two passives, kills the primary, and asserts a passive becomes primary with state continuity in under 30 seconds.
This test was the closing acceptance criterion for issue #66.