ParaloomPARALOOM

Coordinator HA

Active/passive failover for paraloom's coordinator role with sub-30s RTO.

Coordinator high availability

Paraloom uses an active/passive coordinator model. One coordinator is primary at any time; the rest are passives that watch the primary's heartbeat and stand by to take over. Recovery time objective (RTO) is under 30 seconds in the kill-the-primary scenario test.

Why a coordinator

The coordinator role drives:

  • Job distribution (compute jobs assigned across validator cohort)
  • Fee distribution (withdrawal fees → leader)
  • State snapshot replication (so a passive can resume mid-round if the primary dies)

It is not a single point of trust — its decisions are still verified by the BFT cohort. But it is a single point of coordination. HA exists so that a primary crash doesn't pause the network.

Roles

RoleWhat it does
PrimaryHandles incoming jobs, broadcasts heartbeats, replicates state to passives
PassiveListens for primary heartbeat, applies replicated state, becomes primary on failover

The role state machine lives in src/coordinator/role.rs. Snapshot data structure is in src/coordinator/mod.rs.

Heartbeat protocol

The primary broadcasts a heartbeat over the gossipsub heartbeat topic every 5 seconds. Each passive maintains a watchdog timer; if the heartbeat is missed for longer than the failover threshold, the passive starts an election.

The broadcast and watchdog loops are spawned by Node::run and aborted by Node::stop — see src/node/mod.rs.

State replication

The primary replicates a CoordinatorSnapshot containing:

  • Current job queue position
  • In-flight withdrawal verification rounds
  • Recent fee accounting

Passives apply this snapshot to local state on each heartbeat. When a passive becomes primary, it picks up where the previous primary left off — no jobs are lost or double-assigned.

Settings

SettingDefaultNotes
Heartbeat interval5 sconfigurable
Failover threshold15 s (3 missed beats)configurable
Snapshot frequencyevery heartbeatalways co-broadcast
Election quorummajority of registered coordinatorsuses validator stake weights

Scenario test

The kill-the-primary scenario lives in tests/coordinator_tests.rs. It boots a primary plus two passives, kills the primary, and asserts a passive becomes primary with state continuity in under 30 seconds.

This test was the closing acceptance criterion for issue #66.

On this page