SCN (System Change Number) is the Lamport clock pgrac uses for cross-node transaction ordering. Each node independently maintains a 64-bit local counter; transactions atomic-increment it on commit (spec-1.16), and any received envelope carrying a remote SCN drives a CAS-Lamport ≥ max advance (spec-2.4). This rule serves three roles inside a pgrac cluster: it provides a causal baseline for MVCC visibility decisions, stamps WAL commit/abort records with a global timestamp (spec-1.18 xl_scn), and provides causal ordering for every IC envelope (spec-2.4 envelope.scn at offset 20).
This chapter annotates each concept with its Stage status. Stage 1 (spec-1.15 / 1.16 / 1.17 / 1.18 / 1.19) is shipped: 8-byte encoding, three cmp functions, single-node advance/observe, optional xl_scn on commit/abort WAL records, walwriter BOC tick, and the xlp_thread_id placeholder in the WAL page header. Stage 2 Phase 2.C (spec-2.4 / 2.9 / 2.10 / 2.11 / 2.12) wires these ABIs into the cross-node IC plane: envelope.scn piggyback is active, LMON-mediated BOC broadcast is active, commit_scn cross-instance lookup skeleton, and the SCN convergence boundary GUC. Persistence and safety_gap appear in §8 of the SCN protocol design doc but are explicitly deferred by spec-1.16 §1.3; the current crash-recovery path rebuilds SCN by WAL replay (spec-1.18), with no separate persistence file. This chapter respects that boundary.
PG single-instance uses LSN (Log Sequence Number) to identify WAL positions: an LSN is the byte offset within WAL files, monotonically increasing within a single instance, with clear semantics. But LSN carries no "transaction ordering" information — PG's transaction visibility is entirely determined by xmin / xmax plus CLOG; LSN is used only to identify the WAL persistence position and has no relation to "which transaction committed first."
In a pgrac cluster, this mechanism exposes two fundamental shortcomings:
First, LSN is not comparable across nodes. WAL offset 0/4A3F8B0 on node 1 and 0/4A3F8B0 on node 2 are completely independent values; they reflect no causal relationship. If a read transaction on node 2 needs to determine whether a particular commit on node 1 occurred before its own snapshot, LSN alone provides no answer.
Second, xmin / xmax visibility is local to each node. In a cluster, a transaction's XID is assigned on the commit node; other nodes cannot determine its full visibility from CLOG alone — this requires additional cross-node coordination at non-trivial cost.
pgrac follows Oracle RAC's reference design and introduces SCN as the cluster-wide Lamport clock. Each commit is assigned an SCN, written into the WAL record, the block's pd_block_scn, and the ITL slot's commit_scn; a read transaction's snapshot uses the SCN at commit time as its visibility baseline. As a result, a read transaction on any node only needs to compare the tuple's commit_scn against its own snapshot SCN to complete a visibility decision locally, without querying the commit node.
LSN and SCN coexist with non-overlapping responsibilities in pgrac: LSN continues to identify WAL physical positions (used for recovery, replication slots, and checkpoint management); SCN handles cross-node transaction causal ordering and MVCC visibility decisions. Both are written into the WAL record header, but their semantics are entirely independent.
The pgrac SCN is a 64-bit (8-byte) unsigned integer divided into two fields:
node_id, values 0–255, identifying the node that produced this SCN. The cluster supports at most 256 nodes.local_scn, the node's local Lamport counter value.This encoding has two important properties. First, global uniqueness is guaranteed by construction: different nodes have different node_id values, so their high 8 bits differ and identical 64-bit SCN values cannot be produced. Second, ordering comparisons use only the low 56 bits: visibility decisions compare the local_scn portion and discard the node_id high bits — because the high bits reflect the source node rather than causal order, including them in magnitude comparisons would corrupt the happens-before semantics of the Lamport clock.
+-------+--------------------------+
| node | local SCN |
| 8 bit | 56 bit |
+-------+--------------------------+
↑ ↑
256 nodes max ~280K years @ 100K events/sec
The 56-bit local counter can represent approximately 72 quadrillion values. In an OLTP cluster, roughly 100K commits plus 100K piggyback advances occur per second — approximately 200K SCNs per second — giving a theoretical overflow time exceeding 280,000 years. This figure is not a design margin; it is the direct consequence of the encoding choice: the 8-bit node_id supports up to 256 nodes, which has been proven sufficient for a single cluster, and all remaining 56 bits are used as the counter.
InvalidScn = 0 is a protocol-reserved sentinel meaning "not yet set," aligned with PG's InvalidTransactionId = 0 convention. All real SCN values are ≥ 1, so zero can safely be used for zero-initialized struct members.
SCN advances according to the classic Lamport clock rules. pgrac implements three advance paths:
Path 1 — Local commit: on transaction commit, the node atomically increments local_scn by one to obtain the new commit SCN, which is written into the WAL commit record, the TT slot, and the ITL slot. The commit SCN encodes the current node's node_id to form a complete 64-bit SCN.
Path 2 — Receiving an external SCN (BOC or Piggyback): upon receiving an SCN from a remote node, the node executes local_scn = max(local_scn, remote.local_scn) + 1. This max+1 operation ensures that all SCNs produced by this node from that point forward are causally later than the received message — this is precisely the happens-before guarantee of the Lamport clock.
Path 3 — Stamping xl_scn at WAL write time: when inserting a WAL record, the current local_scn is read and written into the record header's xl_scn field. This operation does not advance local_scn; it only snapshots the current value. Advances occur only on commit and piggyback.
Node 1: ─●────●─────────────●───── (commit @ 42)
↘ BOC(43)
Node 2: ───────●●──────────●────── (recv → max(12,43)+1 = 44)
↘ BOC(44)
Node 3: ────────────●─────●─────── (recv → max(8,44)+1 = 45)
The Lamport rule provides causal consistency, not total order. Two concurrent transactions (transactions with no causal relationship) may have SCNs that cannot be compared for ordering, but this is acceptable for MVCC visibility decisions — snapshot isolation correctness does not depend on a total order of concurrent transactions; it only requires that causally related events be ordered.
Cross-node SCN comparison has two semantics, each with its own dedicated API: temporal comparison (the visibility path) compares only the low 56 bits of local_scn, using scn_time_cmp(); total-order comparison (ITL slot ordering, deadlock detection) includes the high 8-bit node_id, using scn_total_cmp(); recovery merge comparison adds a secondary tie-break of LSN + node_id when local_scn values are equal, using scn_recovery_cmp(). Application code must never compare SCN values as bare uint64 magnitudes.
SCN propagates through the cluster via two complementary paths. The current Stage 2 implementation differs from the early SCN design doc (2026-04-25 v1.0): what the design doc calls "piggyback in CF / GES messages" is, in Stage 2, carried at the IC envelope frame level (spec-2.4).
Envelope Piggyback (spec-2.4, active): every IC envelope carries an 8-byte scn field at offset 20. The sender populates it on build via cluster_scn_current(). After framing + epoch + CRC verification, the receiver calls cluster_ic_envelope_observe_scn() → cluster_scn_observe(), which performs a CAS-bump under the Lamport ≥ strict-boundary rule (spec-2.4 §2.7; spec-1.16 lock-free CAS retry loop). This path consumes no extra messages — any cross-node traffic (heartbeat / GES / SCN broadcast / sinval / reconfig) automatically drives SCN convergence.
BOC (Broadcast on Commit, spec-2.9 + spec-2.10): each walwriter tick calls cluster_scn_boc_tick() to bump boc_sweep_count (spec-1.17); the LMON main loop (default 1 s) calls cluster_scn_lmon_drain_boc_broadcast(), which coalesces accumulated sweeps into a single PGRAC_IC_MSG_BOC_BROADCAST=3 frame and fans it out to every alive peer (spec-2.9). Real wire cadence is bounded by LMON tick — roughly 1 fanout/s/peer.
GUC cluster.boc_sweep_interval_ms (PGC_SIGHUP, default 100 ms, range 1..1000, spec-2.10 D1) throttles the walwriter sweep cadence. Before Path C 5-spec plan landed, the default was 1 ms (spec-1.17 v0.2); spec-2.10 chose a production-sane 100 ms — reducing walwriter wake and shmem-atomic churn, not IC bandwidth (IC fanout is still capped at ~1/s by LMON tick). The "100 μs flush" in SCN design doc §6.4 is the Stage 1 early default; it no longer reflects current Stage 2 production defaults.
| Mechanism | Advance method | Implementing spec | Observed wire frequency |
|---|---|---|---|
| Envelope Piggyback | Per-frame envelope.scn at offset 20 | spec-2.4 D4 | Tracks all IC traffic (heartbeat / GES / SCN broadcast / sinval / reconfig) |
| BOC fanout | walwriter sweep → LMON drain → IC fanout | spec-1.17 + spec-2.9 + spec-2.10 | ~1 fanout/s/peer (LMON tick cap) |
spec-2.10 §0 Q5 explicitly corrects an early misconception: "1 ms BOC sweep ≠ 1000 BOC/s on the wire." Real wire cadence equals MIN(walwriter sweep, LMON tick) ≈ LMON tick ~1/s. BOC's value is not high-frequency — it is the guarantee that even with no other cross-node traffic (an idle peer), SCN still converges. Envelope piggyback carries the primary load during active periods.
Four-layer observability chain (spec-2.10 Q2.2): (1) walwriter sweep scn_boc_sweep_count → (2) LMON fanout scn_boc_broadcast_fanout_count → (3) receiver CAS-bump scn_observe_bump_count → (4) per-peer lamport_observe_advance_count (spec-2.4 D10). These four counters are exposed via pg_cluster_state's 'scn' category and the pg_cluster_ic_peers view.
xl_scn OptimizationUnder high-concurrency write workloads, if every backend contended for the same global local_scn atomic counter at WAL insert time, a significant cache-line bouncing hot spot would emerge. pgrac avoids this with the per-thread xl_scn optimization.
spec-1.18 implements xl_scn as an optional field on commit / abort WAL records: when cluster.enabled=on and SCN_VALID(commit_scn), XactLogCommitRecord / XactLogAbortRecord set XACT_XINFO_HAS_SCN (bit 9) and append 8 bytes of xl_xact_scn immediately after xl_xact_origin (spec-1.18 D2-D4). On replay, ParseCommit/AbortRecord performs an unaligned memcpy (HC2) and calls cluster_scn_recovery_replay_observe(). XLOG_XACT_PREPARE does not carry xl_scn (PREPARE is not a durable commit point — spec-1.16 Q5); COMMIT_PREPARED / ABORT_PREPARED do (they are real durable points).
Bootstrap-mode commits and cluster.enabled=off paths return InvalidScn as commit_scn; XactLogCommitRecord sees InvalidScn and omits the HAS_SCN flag. This path is byte-identical to vanilla PG 16 WAL — preserving initdb / pg_upgrade compatibility.
SCN design doc §5.5 describes "stamp xl_scn on every record header" as the AD-008 second extension's design goal (unconditional 32 B header). spec-1.18 chose a more conservative path: the on-disk record-header layout is unchanged, and xl_scn is inserted only when needed. The true advance of local_scn occurs at two serialization points: cluster_scn_advance_for_commit() (commit / abort hot path; spec-1.16 + spec-1.17 lock-free atomic fetch_add) and cluster_scn_observe() (envelope verify path + WAL replay observe path). Concurrent backends writing WAL each atomic-load current_local_scn with no contention.
local_scn lives in the node's shared memory and is volatile. If a node crashes with local_scn = N, restart must avoid recounting from 0 — that would produce SCN values smaller than those already assigned to committed transactions and violate monotonicity.
Stage 1 current implementation (spec-1.18 path): crash recovery rebuilds SCN through WAL replay. XLOG_XACT_COMMIT / XLOG_XACT_ABORT / XLOG_XACT_COMMIT_PREPARED / XLOG_XACT_ABORT_PREPARED WAL records carry the XACT_XINFO_HAS_SCN flag (bit 9) plus 8 bytes of xl_xact_scn when cluster.enabled=on and commit_scn was truly assigned (spec-1.18 D2-D4). During replay, xact_redo_commit/abort calls cluster_scn_observe(parsed.scn) when parsed.xinfo & XACT_XINFO_HAS_SCN is set, advancing current_local_scn by Lamport ≥. After complete WAL replay, current_local_scn is at least as large as any fsync'd commit SCN before the crash.
SCN protocol design doc v1.0 §8 (2026-04-25) describes three independent mechanisms: (1) periodic 100 ms fsync of local_scn to pg_scn/instance_N.scn; (2) on restart, read persisted value P and start from P + safety_gap (default 1,000,000); (3) forced persistence at shutdown / checkpoint. None of these are implemented by any spec to date. spec-1.16 §1.3 explicitly defers persistence: "Persisting local_scn to pg_control / control file ⋯ is out of this spec's scope; spec-1.16+ (yet to be drafted) will do it. Current crash → restart resets local_scn = 0 and re-accumulates." spec-1.18 §3 chose the WAL replay path instead, replacing the standalone persistence file. safety_gap = 1,000,000 and the pg_scn/ directory are design goals, not Stage 1 implementation facts.
Stage 2 Phase 2.C (the "Path C 5-spec plan") closes the SCN subsystem with four sub-specs: spec-2.9 → 2.10 → 2.11 → 2.12. This section enumerates the observable surface and API additions.
spec-2.11 commit_scn cross-instance lookup skeleton: introduces the ClusterScnLookupResult enum (FOUND=0 / DEFER=1 / NOT_FOUND=2 / ERROR=3) and cluster_scn_lookup_commit_remote(xid, *out_commit_scn). The skeleton stub always returns DEFER and increments scn_commit_lookup_defer_count. Callers seeing DEFER MUST fall back to PG-native visibility — DEFER must never be interpreted as INVISIBLE. True activation is deferred to spec-2.26 (dual-dim visibility entry) / Stage 3. The AD-012 exception-9 invariant is preserved: heapam_visibility.c has 0 PGRAC modifications.
spec-2.12 SCN convergence boundary verification: adds GUC cluster.scn_max_propagation_lag_ms (PGC_SIGHUP, default 5000 ms, range 100..60000, GUC_UNIT_MS) plus two shmem fields (last_observe_at and observed_max_observe_gap_ms, atomic uint64, lock-free) and three SQL rows (scn_last_observe_at / scn_seconds_since_last_observe / scn_observed_max_observe_gap_ms). Skeleton stage is measure-only, no enforcement: exceeding the bound does not trigger WARNING / FATAL / overrun counter. TAP 102_scn_convergence_bound_2node.pl performs 3 rounds × bidirectional = 6 measurements to verify real propagation latency < the GUC's actual value.
This chapter has established the SCN conceptual framework: cross-node incomparability of LSN is the fundamental motivation for introducing a Lamport clock; the 8-byte encoding separates node_id (high 8 bits) from local_scn (low 56 bits), with the former guaranteeing global uniqueness and the latter carrying causal order; the three advance paths (commit atomic +1, observe CAS max+1, WAL stamp read-only) cover all SCN write scenarios; envelope piggyback (spec-2.4, every frame at offset 20) carries the primary load while LMON-mediated BOC fanout (spec-2.9 + 2.10) provides the idle safety net; per-thread xl_scn reduces the WAL hot path to an atomic load; and crash recovery rebuilds SCN through the spec-1.18 WAL replay path — no dedicated persistence file. The safety_gap and pg_scn/ in SCN design doc §8 remain design goals, not implementation facts.
For deep protocol details, see the following resources:
scn_encode / scn_time_cmp / scn_recovery_cmp), BOC batch flush timing, persistence file layout, SCN handling during Reconfig freeze periods, complete algorithms for the visibility path and WAL k-way mergemsg_scn piggyback field in Cache Fusion block transfer message headers; field definitions for block pd_block_scn and ITL commit_scnChapter 5 — Reconfiguration covers cluster state reconstruction when the node topology changes: how SCN maintains monotonicity during the freeze / unfreeze phases when nodes leave or join, how a failed node's local_scn is reconstructed from WAL recovery's xl_scn, and the complete flow by which a joining node achieves SCN convergence via KEEPALIVE + piggyback.