When cluster membership changes — a node leaves, joins, or a network partition splits the cluster in two — pgrac must rebuild shared global state within milliseconds to seconds before service can resume. This process is called Reconfiguration, and it is the most complex, most costly step in a RAC architecture's availability design.
Reconfiguration advances serially through three phases: Freeze halts all GES / GCS inbound request processing on every node; Rebuild redistributes resource masters across surviving nodes, reconstructs the GRD, and merges PI chains left behind by the failed node; Thaw activates the new topology and resumes normal backend execution. The typical total elapsed time for all three phases is under 3 seconds (incremental rebuild path), with zero committed-transaction loss throughout. This chapter builds the conceptual framework needed to understand the mechanism: trigger scenarios, the three-phase protocol, the CSSD detection path, Merged Redo Apply, failure guarantees, and operational monitoring practices.
Reconfiguration is triggered by CSSD (Cluster Synchronization Service Daemon) upon detecting a change in cluster membership. Trigger scenarios fall into four categories:
Every trigger scenario travels through the same 3-phase state machine (§5.2). The only difference is the scope of reconstruction: a node leaving / crashing / hanging requires taking over all resources mastered by that node; a node joining requires migrating a portion of resource masters into the new node to rebalance load. The initial implementation takes the full-rebuild path in both cases; incremental rebuild is a Phase 2 optimization target.
| Scenario | Trigger source | Typical elapsed time | Notes |
|---|---|---|---|
| Planned node shutdown (maintenance) | Node actively notifies CSSD | < 3 s | Most graceful; fencing not required |
| Node crash (process crash) | CSSD heartbeat timeout + quorum decision | < 5 s | Requires fence confirmation that writes have stopped before entering Rebuild |
| Network partition | Network heartbeat misscount timeout | 5–15 s | Majority side continues service; minority side is fenced |
| Node hang (OS / IO hang) | disktimeout timeout | 10–30 s | disktimeout is wider than misscount; must wait for disk heartbeat to time out as well |
In a network partition scenario, both sides believe the other is the minority — this is precisely the root of the split-brain problem. pgrac resolves it via the voting disk quorum mechanism described in §5.5: only the side that wins quorum is permitted to continue Reconfiguration; the losing side may not modify shared state before it is fenced.
The Reconfiguration state machine is implemented in cluster/membership/reconfiguration.c and advances through a four-state DAG: stable → freezing → rebuilding → thawing → stable.
Freeze is the first step entering Reconfiguration, and the most expensive. GES / GCS inbound processing on every instance halts — no new global lock requests or Cache Fusion block transfers are accepted. Each backend checks the freeze flag at every critical-path entry point (ReadBuffer, LockAcquire, GetSnapshotData, XLogInsert) via the unified CHECK_FROZEN() macro; if frozen, the backend blocks until Thaw. New transactions are suspended during this period; transactions already holding resources are not interrupted — they simply cannot make progress. The target elapsed time for the Freeze phase is ≤ 2 seconds.
Rebuild is the substantive state reconstruction. Surviving nodes negotiate over the Interconnect to re-elect a master node for each resource (redistributing resources across nodes); after GRD metadata is rebuilt, PI chains left behind by the failed node are merged in and Merged Redo Apply is executed (§5.4); uncommitted transactions on the failed node are rolled back or completed by surviving nodes on their behalf. The target elapsed time is ≤ 10 seconds for the full rebuild, ≤ 3 seconds for the incremental rebuild (Phase 2).
Thaw activates the new topology: each backend's local resource-master cache is force-invalidated; the freeze barrier is lifted and CHECK_FROZEN() resumes pass-through; GES / GCS inbound processing resumes accepting requests; blocked and newly arriving transactions continue execution. The target elapsed time for the Thaw phase is ≤ 2 seconds.
T0 ──── T1 ──────── T2 ──── T3
│ │ │ │
Detect Freeze Rebuild Thaw
│ │ │ │
↓ ↓ ↓ ↓
heartbeat GES/GCS master re-elect new topology
lost inlet + GRD rebuild activates
fence backend PI chain merge backend
trigger blocks resumes
|<-- typical < 3 s -->|
Phase time budget (initial full-rebuild path):
| Phase | Target elapsed | Timeout behavior |
|---|---|---|
| freezing | ≤ 2 s | On timeout, CSSD escalates to full cluster restart |
| rebuilding | ≤ 10 s (full) / ≤ 3 s (incremental, Phase 2) | On timeout, escalates to cluster restart |
| thawing | ≤ 2 s | On timeout, degrades to node restart |
Brownout is the service interruption visible to applications during Reconfiguration — the elapsed time from T1 to T3 (Freeze inlet halt to Thaw completion). The target total brownout across all three phases is ≤ 15 seconds (full rebuild), improving to ≤ 3 seconds after incremental rebuild optimization. The pg_stat_cluster_reconfig view records the actual elapsed time for each phase of every Reconfiguration, for operational monitoring and SLA verification.
CSSD detection is the starting point of the entire Reconfiguration chain. CSSD maintains three independent heartbeat paths simultaneously; only when any one path exceeds its corresponding threshold and quorum decides in favor does CSSD fire a Reconfiguration signal:
/dev/watchdog or IPMI/BMC to confirm the OS layer has not frozen (guards against the scenario where a process is still responding to network heartbeats but the DB subsystem has already died).After any of the three heartbeats times out, CSSD initiates quorum arbitration (#2); once the minority side is determined, the majority side triggers fencing (#3); only after fence completion is confirmed does Reconfiguration enter its Freeze phase. This serial dependency guarantees that when the Rebuild phase begins, the failed node has already been forcibly isolated and can no longer write to shared storage.
CSSD uses two independent thresholds to prevent single-path heartbeat false positives from causing frequent brownouts:
misscount (default 30 seconds): when network heartbeats are lost consecutively beyond this threshold, CSSD considers the network path unreachable. Suited to rapid detection of process crashes or network disconnections, but with a tolerance window for transient jitter (GC pauses, brief network flaps).
disktimeout (default 200 seconds): when a node's disk heartbeat (voting disk slot timestamp) has not been updated beyond this threshold, CSSD considers the node to have an IO hang or to be dead. disktimeout is significantly wider than misscount because disk IO is itself subject to storage latency and OS scheduling, and brief delays do not indicate node loss.
The combined logic of the two thresholds: the network heartbeat acts first; after misscount fires, quorum arbitration begins. If quorum cannot decide (for example, two of three nodes lose network connectivity but disk heartbeats remain alive), the cluster waits for disktimeout to expire. disktimeout expiry is the "final verdict" — at that point, regardless of network state, the node is considered unreachable and fencing is executed. Both thresholds are tunable via GUC; the Oracle equivalents serve as the default starting values.
The prerequisite for the Freeze phase to begin is fence completion (or, when disktimeout expires with no fence possible, escalation handling). This prerequisite constrains the minimum Reconfiguration trigger latency: even if CSSD detects a process crash immediately, it must wait for the fence signal (via SCSI-3 PR or STONITH) to be confirmed before proceeding to unfreeze — preventing "ghost writes" to shared storage by the failed node during Rebuild.
The most complex subtask within the Rebuild phase is merging the WAL redo of the failed node. In a pgrac cluster each node maintains an independent WAL stream; at failure time, WAL records that were persisted on the failed node but not yet broadcast to all nodes must be read by surviving nodes and merged for replay in correct SCN order, so that block state in the GRD, PI chains, and ITL slots remain consistent with WAL.
Why simple serial ordering does not work: each node's WAL stream is monotonically increasing only within that node; SCN values from different nodes are interleaved, and there is no natural "replay node 1 first, then node 2" ordering. Replaying in per-node serial order would break cross-node transactional causality — for example, a write on node 2 that depends on a preceding write on node 1 must be replayed after the latter, not before.
The correct approach — Merged Redo Apply: perform a k-way merge across the streams of all surviving nodes (plus the failed node's WAL read from shared storage), sorted by commit_scn (the low 56 bits, the local_scn portion), producing a globally causal-order replay sequence. Tie-breaking among identical commit_scn values uses LSN + node_id as a secondary sort (corresponding to the scn_recovery_cmp() API; see Chapter 4 §4.3). The merged sequence is replayed record by record in order, guaranteeing that the GRD rebuild result is equivalent to serial single-node execution semantics.
Node 1 stream: ─●─●─●─●──────●─ (SCN: 42, 43, 50, 51, 61)
Node 2 stream: ─●─●─●────●─●─── (SCN: 12, 44, 45, 55, 60)
Node 3 stream: ─●─●───────────── (SCN: 8, 47)
↓ sort-merge by SCN ↓
Merged: ●─●─●─●─●─●─●─●─●─● (SCN: 8, 12, 42, 43, 44, 45, 47, 50, 51, 55, 60, 61)
replay order
PI chain merge is the companion operation to Merged Redo Apply: a block may have multiple Past Image versions spread across nodes; during the Rebuild phase, as redo is replayed, PI versions held by the failed node are merged into the PI chains of surviving nodes, maintaining the completeness of PI chains in the GRD and ensuring subsequent Cache Fusion block transfers can correctly serve old-version read requests.
The total work of Merged Redo Apply depends on the volume of WAL backlog on the failed node (the WAL produced from the last checkpoint to the time of failure) and the length of the PI chains. On the incremental rebuild path, only WAL for resources mastered by the failed node needs to be replayed, substantially reducing Rebuild phase elapsed time.
Zero committed-transaction loss is the most fundamental promise of pgrac Reconfiguration. Any transaction that committed before Freeze has already had its WAL record persisted to shared storage (the redo log), and it will necessarily be replayed during the Rebuild phase's Merged Redo Apply. Transactions that commit during Freeze are blocked, not discarded — backends wait at CHECK_FROZEN(), and the commit path continues after Thaw.
Split-brain protection is implemented through two layered mechanisms:
The first layer is voting disk quorum (#2): after a cluster partition, only the side holding the majority of votes (typically ≥ N/2+1 nodes) can win quorum; nodes on the losing side have no right to initiate Reconfiguration and automatically stop writing.
The second layer is fencing (#3): SCSI-3 Persistent Reservation rejects the minority side's I/O requests at the storage HBA layer (RESERVATION_CONFLICT), so even if minority-side processes are still running they cannot write to shared storage; STONITH serves as a backstop, forcibly restarting minority-side physical machines via IPMI/BMC when SCSI-3 PR is unavailable.
The two protection layers are mutually redundant: either one taking effect prevents split-brain writes. Only after fence completion (or after disktimeout confirms the node is dead) is the Rebuild phase permitted to begin modifying the GRD.
Handling uncommitted transactions: after Reconfiguration completes, all uncommitted transactions on the failed node (xmin not committed, CLOG not marked committed) are treated as aborted; surviving nodes complete the rollback of these transactions during Rebuild, and in snapshots taken after Thaw they are invisible to all nodes.
In the initial implementation, long-running transactions that span a Reconfiguration are interrupted rather than resumed. Backends block during Freeze, but if Rebuild runs beyond a threshold (default 30 seconds), incomplete transactions are forcibly terminated and an error is returned to the client. This is an engineering trade-off — correctness in exchange for implementation simplicity. Oracle 11g+ supports long transactions resuming after Reconfiguration; this capability is a pgrac Phase 2 optimization target.
kill -9 drills: within a planned maintenance window, sending kill -9 $postmaster_pid to a single node is the most direct way to drill Reconfiguration. With the default misscount threshold of 30 seconds and no reduction configured, the drill will wait approximately 30 seconds before triggering; for rapid drills on a test cluster, lower cssd.misscount to 5 seconds. After the drill, observe the reconfig_count, last_reconfig_duration_ms, and per-phase elapsed fields in the pg_stat_cluster_reconfig view to confirm brownout is within the target SLA.
Reconfiguration frequency monitoring: an abnormal rate of unplanned Reconfigurations (for example, more than 3 in one hour) is an early signal of network jitter, storage I/O jitter, or backend hangs. Recommended alert rule:
-- Query the number of reconfigs in the past 1 hour
SELECT reconfig_count, last_reconfig_at, last_reconfig_duration_ms
FROM pg_stat_cluster_reconfig;
reconfig_freeze_duration_ms and reconfig_rebuild_duration_ms track the two most expensive phases individually; if rebuild elapsed time trends upward persistently, it typically indicates PI chain backlog or insufficient Interconnect bandwidth, requiring checkpoint frequency optimization or network capacity expansion.
Application-layer retry: clients that receive a connection drop or timeout error (not a SQL error) during brownout should implement exponential backoff retry: first retry after 100 ms, up to 5 retries, maximum interval 5 seconds. After pgrac Thaw completes, new connections can be served immediately; persistent connection pools (such as PgBouncer) configured with server_connect_timeout = 10s cover the majority of brownout windows. Applications should not conflate connection drops caused by Reconfiguration with business-logic errors — Reconfiguration is a normal cluster self-healing behavior, and data state is consistent after the client retries.
For deep protocol details, see the following resources:
CHECK_FROZEN() macro injection points, state machine transition conditions, inter-phase handshake message formats, pg_stat_cluster_reconfig field definitions, incremental rebuild algorithm designChapter 6 — Wait Events Reference will cover wait events related to Reconfiguration in detail, including trigger conditions, typical durations, and diagnostic methods for events such as reconfig_freeze (backend blocked at the freeze barrier), reconfig_rebuild (waiting on GRD rebuild), and reconfig_thaw (waiting for new topology to activate).