When cluster membership changes — a node leaves, joins, or a network partition splits the cluster in two — pgrac must rebuild global shared state in milliseconds to seconds to keep serving. This process is called Reconfiguration, and it is the most complex, highest-cost element of RAC availability design.
Stage 2 ships the minimum closure of Reconfiguration: CSSD heartbeat detects node failure (spec-2.5) → voting disk quorum arbitrates the majority (spec-2.6) → fence-lite self-isolates the minority (spec-2.28) → the reconfig coordinator advances the epoch and broadcasts a ProcSignal (spec-2.29) → backends fail-closed in ProcessInterrupts. The full Drain / Quiesce / Commit / Resume four-phase state machine is deferred to spec-2.31; production hardware fencing (STONITH / SCSI-3 PR) is deferred to Stage 6 production hardening. This chapter describes the current Stage 2 design.
Every number, GUC, SQLSTATE, and view name cited in this chapter is sourced from Stage 2 specs (spec-2.5 / 2.6 / 2.28 / 2.29). Oracle-style terms such as misscount=30s, disktimeout=200s, three-way heartbeat, SCSI-3 PR / STONITH, IMR (Instance Membership Recovery) are not pgrac Stage 2 concepts and are not used here.
Reconfiguration is triggered after CSSD (Cluster Synchronization Service Daemon) detects a membership change. Four scenarios — all routed through the same minimum-closure path:
| Scenario | Trigger source | Typical latency | Notes |
|---|---|---|---|
| Planned exit | Node announces to CSSD | < 1 s | Graceful; no fence needed |
| Crash / process death | CSSD dead detection (3 × 1 s = 3 s default) | 3–5 s | fence-lite self-isolate + reconfig coordinator epoch++ |
| Network partition | CSSD heartbeat loss + voting disk quorum vote | 5–10 s | Quorum-winning side continues; losing side fail-closed |
| Node hang / IO stall | CSSD dead (3 s) + voting disk lease expire (2 × poll_interval = 4 s default) | 5–10 s | Lease expiry forces quorum_state → LOST |
Under a network partition both sides see the other as gone — the root of split-brain. pgrac defends with two layers: voting disk quorum (§5.3.2) — only the majority-holding side returns cluster_qvotec_in_quorum() == true; and fence-lite (§5.5.1) — the losing side has ClusterFenceFreezePending set, so all in-flight transactions fail-closed in ProcessInterrupts. Either layer alone is sufficient; together they are redundant.
The conceptual model is the classic Freeze / Rebuild / Thaw trilogy, but Stage 2 implements only the minimum closure: spec-2.28 provides the Freeze / Thaw signals, spec-2.29 provides epoch advance and coordinator election. The full multi-phase state machine (drain / quiesce / commit / resume) is deferred to spec-2.31.
Freeze takes effect through two independent and redundant paths:
cluster_qvotec_in_quorum(); if it returns false, the commit boundary ereport(ERROR, 53R40 ERRCODE_CLUSTER_QUORUM_LOST). This is the authoritative fail-closed predicate.quorum OK→LOST it immediately broadcasts PROCSIG_CLUSTER_FREEZE_WRITES; each backend reads ClusterFenceFreezePending in ProcessInterrupts, and if IsTransactionState() is true, ereport(ERROR, 53R50 ERRCODE_CLUSTER_QUORUM_LOST_BACKEND). This path is only a latency reducer for backends already in flight; it is not authoritative.Rebuild in Stage 2 is fundamentally epoch advance + ProcSignal broadcast: the reconfig coordinator (§5.2.1) bumps cluster_epoch from N to N+1 via atomic CAS and broadcasts cluster_reconfig_start_pending to all backends in every in_quorum survivor's ProcArray. Each backend reads the flag in ProcessInterrupts; if still in_quorum it ereport(ERROR, 53R60 ERRCODE_CLUSTER_RECONFIG_IN_PROGRESS) (retry-safe); if no longer in_quorum, 53R50 takes precedence.
Thaw is informational: PROCSIG_CLUSTER_THAW_WRITES only updates last_thaw_at_us; it does not clear ClusterFenceFreezePending and does not change cluster_qvotec_in_quorum(). The commit-gate stays authoritative; Thaw exists for LMON coordination and operational visibility.
T0 ─── T1 ───── T2 ──────── T3
│ │ │ │
CSSD qvotec coordinator in_quorum
peer detects picks min restored
DEAD quorum survivor (Thaw is
LOST +epoch++ informational)
+PROCSIG
broadcast
│ │
↓ ↓
freeze reconfig
signal signal
53R40 53R60
|<-- Stage 2 minimum closure, typical 3-5s -->|
spec-2.29 implements the reconfig coordinator as a stateless deterministic function cluster_reconfig_lmon_tick(), invoked on every LMON daemon tick (cluster.lmon_tick_interval_ms, default 100 ms). It introduces no new aux process, holds no state, and can be re-entered without side effects — after a crash the next tick recomputes everything from scratch.
Each tick executes:
cluster_qvotec_in_quorum() == false, this node does not participate. Return.dead_bitmap from declared CSSD peers (16 bytes, max 128 nodes). If zero, no peer death. Return.alive_set (state ∈ {ALIVE, SUSPECTED}) and survivor_set = alive_set & ~dead_bitmap (plus self if in_quorum).coordinator_node_id = lowest_bit_set(survivor_set): min(survivor_set) restricted to cluster_qvotec_in_quorum() == true.event_id = siphash2_4(dead_bitmap || cssd_dead_generation). If equal to last applied, dedup-skip.cluster_reconfig_broadcast_local_procsig() to all backends in its ProcArray (survivor-broadcast symmetry, Invariant I7).cluster_reconfig_apply_epoch_bump_as_coordinator(): atomic CAS new_epoch = old_epoch + 1, record changed_at_lsn, publish event with observer_role = 'coordinator'.observer_role = 'survivor' (epoch converges later via IC envelope piggyback).event_id uses SipHash-2-4 instead of a plain hash because the same dead bitmap can recur via "die → resurrect → die" cycles; cssd_dead_generation (a monotonic counter advanced on every CSSD state flip) provides a second axis for disambiguation. old_epoch is intentionally not hashed, to avoid self-bump loops.
Epoch propagation: every IC envelope (spec-2.3 format) carries the epoch field (offset 12, 8 bytes); the sender writes cluster_epoch_get_current(), and after CRC + auth verification the receiver calls cluster_epoch_observe_remote(). Single-observation jumps are capped at CLUSTER_EPOCH_OBSERVE_MAX_JUMP (default 16) to defend against malicious or corrupted frames.
View pg_cluster_reconfig_state always returns exactly 1 row (contract P2.9): the never-triggered state shows event_id = 0 / observer_role = 'none' / applied_at IS NULL. Nine columns: event_id / coordinator_node_id / old_epoch / new_epoch / dead_bitmap / applied_at / observer_role / event_seq / cssd_dead_generation.
spec-2.29 deliberately omits an explicit multi-phase state machine. Q1 defines the Stage 2 goal as the "minimum closure": CSSD DEAD → LMON deterministic coordinator → epoch++ → PROCSIG broadcast → ProcessInterrupts fail-closed. A four-phase state machine (Drain / Quiesce / Commit / Resume) is deferred to hypothetical spec-2.31. Stage 2's ClusterReconfigState retains only the last applied event (CLUSTER_RECONFIG_MAX_EVENT_HISTORY = 1); an event-history ring buffer is also spec-2.31+ scope.
CSSD is the head of the Reconfiguration chain. pgrac implements two layers of dead detection: socket-level (spec-2.4, TCP keepalive, worst case 120 s) and application-level (spec-2.5, CSSD heartbeat, default 3 s). CSSD itself only declares state (writes LOG, counters, view rows); it does not trigger reconfig — that decision is the coordinator's next LMON tick (§5.2.1).
| Layer | Implementation | Typical dead-detection time | Responsibility |
|---|---|---|---|
| Socket-level (kernel) | TCP keepalive: SO_KEEPALIVE + TCP_KEEPIDLE/INTVL/CNT | Worst case 60 s idle + 6 × 10 s probe = 120 s | Peer close / link down → EPIPE / ECONNRESET → reconnect |
| Application-level (CSSD) | Broadcast heartbeat envelope every cssd_heartbeat_interval_ms (msg_type 11, 12-byte payload) | 3 × 1000 ms = 3 s (default) | Shmem state + LOG / WARNING + view row; does not trigger reconfig |
CSSD daemon (aux #5, after LMON / LCK / DIAG / Stats) maintains a three-state machine per declared peer: ALIVE → SUSPECTED → DEAD. Any recv immediately reverts to ALIVE (hysteresis recovery). CSSD does not own a TCP fd — it writes to a shmem outbound queue; LMON drains the queue and sends via tier1 IC transport.
| GUC | Default | Range | Notes |
|---|---|---|---|
cluster.cssd_main_loop_interval_ms | 1000 | 100–60000 | CSSD main loop tick |
cluster.cssd_heartbeat_interval_ms | 1000 | 100–10000 | Heartbeat broadcast interval |
cluster.cssd_dead_deadband_factor | 3 | 2–10 | Dead threshold = factor × interval |
Transition math:
suspected_factor = max(2, deadband_factor - 1) = 2 (default)ALIVE → SUSPECTED: 2 × interval = 2000 ms no recvSUSPECTED → DEAD: deadband_factor × interval = 3 × 1000 = 3000 ms no recvready_at; transitions are suppressed during grace.View pg_cluster_cssd_peers exposes per-peer state, last recv timestamp, and total recv count. SQLSTATEs: 53R32 ERRCODE_CLUSTER_CSSD_PEER_SUSPECTED (LOG) / 53R33 ERRCODE_CLUSTER_CSSD_PEER_DEAD (WARNING, but does not trigger reconfig — reconfig is the coordinator's job).
spec-2.6 introduces an independent daemon cluster_qvotec (Quorum Voting Coordinator, aux #6) for arbitration. Voting disk slot layout: each instance occupies exactly 512 bytes, sector-aligned (but not claimed sector-atomic); torn writes are detected via generation counter + CRC32C.
| GUC | Default | Range | Notes |
|---|---|---|---|
cluster.voting_disks | "" (empty = qvotec disabled) | CSV paths, 1–5 entries | 3 is the recommended default |
cluster.quorum_poll_interval_ms | 2000 | 500–30000 | qvotec poll cycle |
cluster.voting_disk_io_timeout_ms | 5000 | 1000–60000 | Single-disk I/O timeout |
cluster.voting_disk_size_bytes | 65536 (64 KB) | 4096–1048576 | Slot region size per disk |
Quorum math: quorum_size = (N/2) + 1 where N = number of cluster.voting_disks. A node is in_quorum when:
disks_ok_count >= (disks_total_count / 2) + 1
AND
alive_bitmap_count >= (cluster_node_count / 2) + 1
Four quorum states: INITIALIZING / OK / UNCERTAIN / LOST. Fail-closed semantics — any non-OK state triggers 53R40 / 53R41 at the commit boundary.
Lease defense: cluster_qvotec_in_quorum() returns true only when state == OK and now < lease_expire_at_us. lease = last_poll_ts + 2 × poll_interval. This lease guarantees that even if the qvotec process hangs, backends recognize quorum-untrustworthy within ~4 s (default) and fail-closed.
Failure handling:
disks_ok_count == 0 → quorum_state = LOST → fail-closed0 < disks_ok_count < majority → quorum_state = UNCERTAIN → fail-closeddisk_io_failure_inflight, retry next cycle, still form quorum via other disksself.incarnation > slot.incarnation) → self FATAL with 53R43SQLSTATEs: 53R40 CLUSTER_QUORUM_LOST (commit boundary) / 53R41 CLUSTER_QUORUM_UNCERTAIN (poll inflight) / 53R42 CLUSTER_VOTING_DISK_IO_FAILURE (EIO / EOF / CRC mismatch) / 53R43 CLUSTER_NODE_ID_COLLISION.
Views: pg_cluster_quorum_state (7-column single row) + pg_cluster_voting_disks (one row per disk).
The most complex subtask of Rebuild is merging the failed node's WAL redo. In a pgrac cluster each node maintains an independent WAL stream (pg_wal_node_N/); at failure time, WAL records that have been persisted by the failed node but not yet broadcast to all peers must be read by surviving nodes and replayed in the correct SCN order, to keep GRD block state, PI chains, and ITL slots consistent with WAL.
Why not just serial node-by-node order: each node's WAL stream is monotonic only within that node; SCNs are interleaved across nodes, so there is no natural "replay node 1 then node 2" order. Replaying by node order would break cross-node causal relationships — e.g. a write on node 2 might depend on a prior write from node 1, and the latter must be replayed first.
Correct procedure: perform a k-way merge over all surviving streams (plus the failed node's WAL read from shared storage), ordered by commit_scn (the low-56-bit local_scn portion), to produce a globally causal-ordered replay sequence. Same-SCN ties break on LSN + node_id (the scn_recovery_cmp() API; see Chapter 4 §4.3). Replay the merged sequence in order, and GRD rebuild matches single-node serial-execution semantics.
Node 1 stream: ─●─●─●─●──────●─ (SCN: 42, 43, 50, 51, 61)
Node 2 stream: ─●─●─●────●─●─── (SCN: 12, 44, 45, 55, 60)
Node 3 stream: ─●─●───────────── (SCN: 8, 47)
↓ merge by SCN ↓
Merged: ●─●─●─●─●─●─●─●─●─● (SCN: 8, 12, 42, 43, 44, 45, 47, 50, 51, 55, 60, 61)
replay order
PI chain merging accompanies Merged Redo Apply: every block may have multi-version Past Images across nodes; while replaying redo, Rebuild merges PIs held by the failed node into surviving PI chains, preserving GRD's PI chain integrity so later Cache Fusion block transfers can still serve old-version reads correctly.
Total Merged Redo Apply work scales with the failed node's WAL backlog (WAL volume from last checkpoint to failure time) and PI chain depth. The incremental rebuild path (spec-2.31+ scope) replays only WAL belonging to resources mastered by the failed node, cutting Rebuild dramatically.
Zero committed-transaction loss is pgrac Reconfiguration's core promise. Any transaction committed before Freeze has its WAL record persisted to shared storage and is guaranteed to be replayed during the Merged Redo Apply step of Rebuild. Transactions at the commit boundary during Freeze are blocked by cluster_qvotec_in_quorum() and ereport(53R40) — the transaction rolls back, the data is untouched. In-flight transactions are aborted via PROCSIG_CLUSTER_FREEZE_WRITES + ClusterFenceFreezePending (53R50). Transactions waiting for reconfig receive 53R60 (retry-safe).
Split-brain defense stacks two layers:
Layer 1 — voting disk quorum: after a partition, only the side holding majority votes ((N/2)+1) returns cluster_qvotec_in_quorum() == true; the losing side automatically transitions quorum_state → LOST and fires fence-lite.
Layer 2 — fence-lite (§5.5.1): broadcasts PROCSIG_CLUSTER_FREEZE_WRITES + sets ClusterFenceFreezePending + checked in ProcessInterrupts. Either layer alone suffices to block the minority from writing shared storage.
spec-2.28 ships fence-lite (self-fence). When LMON detects quorum_state OK→LOST it immediately broadcasts freeze (no grace). Three components:
procsignal.h since Stage 0.15+ (October 2024); spec-2.28 only activates the handlers — no ABI break.volatile sig_atomic_t flag, signal-safely written to 1 in the freeze handler.cluster_fence_check_interrupts() placed after ProcDiePending and before QueryCancelPending; follows the read-clear-then-decide pattern: first clear ClusterFenceFreezePending, then if cluster_freeze_writes_enabled is on and IsTransactionState() is true, ereport(ERROR, 53R50); idle backends silently absorb.Thaw is informational (PROCSIG_CLUSTER_THAW_WRITES): the handler updates last_thaw_at_us, does not clear ClusterFenceFreezePending, and does not change cluster_qvotec_in_quorum(). The commit-gate remains the authoritative fail-closed predicate (Invariant I2).
| GUC | Default | Notes |
|---|---|---|
cluster.self_fence_enabled | on | Enables the self-fence escalation path |
cluster.self_fence_grace_ms | 30000 | self-fence grace window |
cluster.freeze_writes_enabled | on | Enables the ProcessInterrupts in-flight abort path |
cluster.fence_audit_log | log | off / log / debug |
Self-fence escalation: after LMON sets self_fence_requested_at_us = now, the postmaster's ServerLoop tick calls cluster_fence_postmaster_check(); if now - requested_at_us >= self_fence_grace_ms × 1000, the postmaster self-signals SIGINT (PG's native fast-shutdown path — not a hardware reset).
View pg_cluster_fence_state (8-column single row): last_freeze_at / last_thaw_at / self_fence_pending / self_fence_grace_remaining_ms / freeze_broadcast_count / thaw_broadcast_count / self_fence_initiated_count / freeze_signal_received_count.
fence-lite does NOT include: external cluster.fence_command shell GUC, peer-fence actor (kill remote node), IPMI / iLO / vSphere kill-peer integration, SCSI-3 PR / hardware fencing, pgracd supervisor daemon. STONITH and hardware-level fencing are deferred to Stage 6 production hardening (spec-2.0 Q-C locks Stage 2 minimum invariants to quorum-lite + fence-lite + fail-closed). Stage 2's fence-lite suffices for "zero committed-transaction loss + split-brain defense" because quorum failure + commit-gate fail-close form a double safety net.
Uncommitted transactions: after Reconfiguration, all uncommitted transactions on the failed node (xmin uncommitted, CLOG not marked committed) are treated as aborted; surviving nodes complete their rollback during Rebuild, and after Thaw they are invisible in every new snapshot across all nodes.
In the initial implementation, long transactions crossing a Reconfiguration receive 53R60 and roll back (retry-safe; one retry typically succeeds). Oracle 11g+ supports long-running transactions to survive Reconfiguration; that capability is spec-2.31+ scope in pgrac. Within Stage 2, cross-Reconfiguration transactions always roll back.
Failure drills (kill -9): during a maintenance window, kill -9 $postmaster_pid on a single node is the most direct Reconfiguration drill. CSSD default 1 s × 3 = 3 s marks the peer DEAD; for faster drills, drop cluster.cssd_heartbeat_interval_ms to 200 ms (dead threshold drops to 600 ms). After the drill:
SELECT event_id, coordinator_node_id, old_epoch, new_epoch,
applied_at, observer_role
FROM pg_cluster_reconfig_state;
SELECT * FROM pg_cluster_fence_state;
SELECT * FROM pg_cluster_quorum_state;
SELECT * FROM pg_cluster_cssd_peers;
Confirm new_epoch = old_epoch + 1, observer_role is coordinator or survivor, and self_fence_initiated_count matches freeze_signal_received_count.
Reconfiguration rate monitoring: unplanned Reconfiguration frequency above normal (e.g. > 3 per hour) is an early signal of network jitter, storage I/O jitter, or backend hang. pg_cluster_reconfig_state currently retains only the last event (CLUSTER_RECONFIG_MAX_EVENT_HISTORY = 1); the event-history ring buffer is spec-2.31+ scope. Operations should track frequency via event_seq deltas over time.
Application-side retry: error codes the client may see during a brownout and how to handle them:
| SQLSTATE | Meaning | Client action |
|---|---|---|
| 53R40 CLUSTER_QUORUM_LOST | Quorum lost at commit boundary | Transaction rolled back; wait for quorum recovery and retry |
| 53R50 CLUSTER_QUORUM_LOST_BACKEND | In-flight aborted by fence-lite | Same; retry-safe |
| 53R60 CLUSTER_RECONFIG_IN_PROGRESS | In-flight during a reconfig event | Immediately retryable; usually succeeds on first retry |
Recommend exponential backoff retry: first 100 ms, max 5 retries, max interval 5 s. After pgrac epoch advance completes (typically < 1 s), new connections and new transactions can serve immediately; a persistent connection pool (e.g. PgBouncer) configured with server_connect_timeout = 10s covers most brownout windows.
For deeper protocol detail, see:
master[] shmem table, resource redistribution during RebuildChapter 6 — Wait Events Reference covers Reconfiguration-related events (the Cluster: Reconfig class — 5 events): Reconfig: GRD rebuild, Reconfig: lock recovery, Reconfig: fence wait, Reconfig: master selection, Reconfig: barrier wait — their triggers, typical durations, and diagnostic procedures.