SCN (System Change Number) is the Lamport clock pgrac uses for cross-node transaction ordering. Each node independently maintains a monotonically increasing local counter; the counter is incremented and broadcast to other nodes on transaction commit; upon receiving a message the node advances its counter by taking the max. This seemingly simple Lamport rule serves three roles inside a pgrac cluster: it provides a causal baseline for MVCC visibility decisions, stamps WAL records with a global timestamp, and provides causal ordering for Cache Fusion and GES messages. Without SCN there is no comparable "happened-before" ordering across nodes; with SCN, every node can determine locally — from its own data — which transactions committed before its snapshot.
This chapter builds the conceptual framework needed to understand SCN: why PG's native LSN is insufficient, how the SCN 8-byte encoding is organized, the three paths of the Lamport advance rule, the two propagation mechanisms (BOC and Piggyback), the design motivation behind the per-thread xl_scn optimization, and the persistence and crash anti-regression mechanism. Protocol details (message formats, field definitions, the CAS algorithm, persistence file layout) are left to deep pages; this chapter establishes only the conceptual vocabulary.
PG single-instance uses LSN (Log Sequence Number) to identify WAL positions: an LSN is the byte offset within WAL files, monotonically increasing within a single instance, with clear semantics. But LSN carries no "transaction ordering" information — PG's transaction visibility is entirely determined by xmin / xmax plus CLOG; LSN is used only to identify the WAL persistence position and has no relation to "which transaction committed first."
In a pgrac cluster, this mechanism exposes two fundamental shortcomings:
First, LSN is not comparable across nodes. WAL offset 0/4A3F8B0 on node 1 and 0/4A3F8B0 on node 2 are completely independent values; they reflect no causal relationship. If a read transaction on node 2 needs to determine whether a particular commit on node 1 occurred before its own snapshot, LSN alone provides no answer.
Second, xmin / xmax visibility is local to each node. In a cluster, a transaction's XID is assigned on the commit node; other nodes cannot determine its full visibility from CLOG alone — this requires additional cross-node coordination at non-trivial cost.
pgrac follows Oracle RAC's reference design and introduces SCN as the cluster-wide Lamport clock. Each commit is assigned an SCN, written into the WAL record, the block's pd_block_scn, and the ITL slot's commit_scn; a read transaction's snapshot uses the SCN at commit time as its visibility baseline. As a result, a read transaction on any node only needs to compare the tuple's commit_scn against its own snapshot SCN to complete a visibility decision locally, without querying the commit node.
LSN and SCN coexist with non-overlapping responsibilities in pgrac: LSN continues to identify WAL physical positions (used for recovery, replication slots, and checkpoint management); SCN handles cross-node transaction causal ordering and MVCC visibility decisions. Both are written into the WAL record header, but their semantics are entirely independent.
The pgrac SCN is a 64-bit (8-byte) unsigned integer divided into two fields:
node_id, values 0–255, identifying the node that produced this SCN. The cluster supports at most 256 nodes.local_scn, the node's local Lamport counter value.This encoding has two important properties. First, global uniqueness is guaranteed by construction: different nodes have different node_id values, so their high 8 bits differ and identical 64-bit SCN values cannot be produced. Second, ordering comparisons use only the low 56 bits: visibility decisions compare the local_scn portion and discard the node_id high bits — because the high bits reflect the source node rather than causal order, including them in magnitude comparisons would corrupt the happens-before semantics of the Lamport clock.
+-------+--------------------------+
| node | local SCN |
| 8 bit | 56 bit |
+-------+--------------------------+
↑ ↑
256 nodes max ~280K years @ 100K events/sec
The 56-bit local counter can represent approximately 72 quadrillion values. In an OLTP cluster, roughly 100K commits plus 100K piggyback advances occur per second — approximately 200K SCNs per second — giving a theoretical overflow time exceeding 280,000 years. This figure is not a design margin; it is the direct consequence of the encoding choice: the 8-bit node_id supports up to 256 nodes, which has been proven sufficient for a single cluster, and all remaining 56 bits are used as the counter.
InvalidScn = 0 is a protocol-reserved sentinel meaning "not yet set," aligned with PG's InvalidTransactionId = 0 convention. All real SCN values are ≥ 1, so zero can safely be used for zero-initialized struct members.
SCN advances according to the classic Lamport clock rules. pgrac implements three advance paths:
Path 1 — Local commit: on transaction commit, the node atomically increments local_scn by one to obtain the new commit SCN, which is written into the WAL commit record, the TT slot, and the ITL slot. The commit SCN encodes the current node's node_id to form a complete 64-bit SCN.
Path 2 — Receiving an external SCN (BOC or Piggyback): upon receiving an SCN from a remote node, the node executes local_scn = max(local_scn, remote.local_scn) + 1. This max+1 operation ensures that all SCNs produced by this node from that point forward are causally later than the received message — this is precisely the happens-before guarantee of the Lamport clock.
Path 3 — Stamping xl_scn at WAL write time: when inserting a WAL record, the current local_scn is read and written into the record header's xl_scn field. This operation does not advance local_scn; it only snapshots the current value. Advances occur only on commit and piggyback.
Node 1: ─●────●─────────────●───── (commit @ 42)
↘ BOC(43)
Node 2: ───────●●──────────●────── (recv → max(12,43)+1 = 44)
↘ BOC(44)
Node 3: ────────────●─────●─────── (recv → max(8,44)+1 = 45)
The Lamport rule provides causal consistency, not total order. Two concurrent transactions (transactions with no causal relationship) may have SCNs that cannot be compared for ordering, but this is acceptable for MVCC visibility decisions — snapshot isolation correctness does not depend on a total order of concurrent transactions; it only requires that causally related events be ordered.
Cross-node SCN comparison has two semantics, each with its own dedicated API: temporal comparison (the visibility path) compares only the low 56 bits of local_scn, using scn_time_cmp(); total-order comparison (ITL slot ordering, deadlock detection) includes the high 8-bit node_id, using scn_total_cmp(); recovery merge comparison adds a secondary tie-break of LSN + node_id when local_scn values are equal, using scn_recovery_cmp(). Application code must never compare SCN values as bare uint64 magnitudes.
SCN propagates through the cluster via two complementary mechanisms:
BOC (Broadcast on Commit) is active propagation: after a transaction commits, the node immediately broadcasts a lightweight message carrying the commit_scn to all other nodes. BOC ensures that even when the cluster has no other cross-node messages at that moment, a newly committed SCN is promptly recognized by other nodes, maintaining a lower bound on SCN synchronization. To avoid message storms under high TPS, BOC uses a batch flush strategy: every 100 μs, all commit SCNs accumulated during that interval are merged into a single outgoing message, reducing per-node BOC message rates from 100K/s to approximately 10K/s.
Piggyback is passive propagation: all Cache Fusion and GES message headers embed a msg_scn field carrying the sender's current local_scn. The receiver executes receive_piggyback(msg_scn) while processing the message, completing the Lamport advance at no additional message cost. In a busy OLTP cluster the cross-node message density is very high (approximately 600K messages per second at 100K TPS), so the piggyback advance frequency far exceeds BOC, typically keeping per-node SCN lag under 1 ms.
| Mechanism | Advance method | Advantage | Applicable scenario |
|---|---|---|---|
| BOC | Active broadcast | Guarantees eventual propagation; no idle window | Low-activity periods; after critical commit events |
| Piggyback | Rides existing messages | Zero additional messages; automatic high-frequency sync | Busy OLTP periods; dense CF / GES paths |
Both mechanisms run simultaneously: BOC provides the safety net, Piggyback carries the primary load. With Piggyback alone and no BOC, when two nodes in the cluster have no direct message exchange (for example, during low-activity periods), SCN synchronization can lag by seconds; the presence of BOC constrains the lag upper bound to the 100 μs batch flush interval.
xl_scn OptimizationUnder high-concurrency write workloads, if every backend contended for the same global local_scn atomic counter at WAL insert time, a significant cache-line bouncing hot spot would emerge. pgrac avoids this with the per-thread xl_scn optimization.
The xl_scn field (8 bytes) in the WAL record header is filled inside the WAL insert critical section by reading the current local_scn snapshot value — no atomic increment is performed. Multiple backends can read local_scn and write their own WAL records simultaneously without interfering with each other, because the read requires only an atomic load, not a CAS or fetch-and-add.
The true advance of local_scn occurs only at two serialization points: the commit path (atomic +1 on transaction commit) and the piggyback receive path (CAS loop advance). Both paths are already serialized and introduce no additional contention.
The result of WAL stamping is that multiple concurrent transactions at the same moment can hold identical xl_scn values (all reading the same local_scn snapshot), but the commit_scn values of different committing transactions are always distinct (the commit path's atomic +1 guarantees strict monotonicity). This has no effect on protocol correctness, because visibility decisions are based on commit_scn; the WAL record's xl_scn is used only for WAL k-way merge and SCN reconstruction during crash recovery, where the scn_recovery_cmp() function performs a deterministic tie-break ordering.
This optimization reduces the SCN-related overhead in the WAL write hot path from an atomic CAS (~30 ns) to an atomic load (~10 ns). At a scale of 100K WAL records per second, this saves approximately 2 ms of single-core overhead per second.
local_scn lives in the node's shared memory and is volatile data. If a node crashes with local_scn = N, after restart there is no direct way to know the last SCN before the crash — restarting from 0 would produce SCN values smaller than those already assigned to committed transactions, violating monotonicity and breaking visibility decisions.
pgrac guarantees crash safety through the following mechanisms:
Periodic persistence: a background worker fsyncs the current local_scn to a dedicated SCN persistence file (one per node, located in the pg_scn/ directory) every 100 ms. This ensures that at most 100 ms of SCN advance history is lost in a crash.
safety_gap compensation: on restart, the node reads the persisted SCN value P and sets its initial value to P + safety_gap (default 1,000,000). This gap covers SCN values that may have been assigned but not yet persisted between the last fsync and the crash, ensuring that all new SCNs after restart do not overlap with values already assigned before the crash.
The rationale for safety_gap = 1,000,000: even if the node ran at 100K commits/s for the full 100 ms window, at most approximately 10,000 new SCNs were produced; 1M is 100× that peak, and relative to the 72 quadrillion upper bound of the 56-bit counter, the extra "wasted" SCN values are negligible.
Equivalent standing to pg_control: the SCN persistence file's design target is equivalent to pg_control (PostgreSQL's control file) — any event that could affect SCN monotonicity (normal shutdown, checkpoint) triggers a forced persistence rather than waiting for the next 100 ms cycle.
After a node restarts, it announces its safety_gap-adjusted starting SCN to other nodes via KEEPALIVE messages; other nodes receive this and advance their own local_scn through piggyback. This process requires no dedicated "SCN negotiation" protocol — the Lamport max+1 rule naturally achieves convergence.
This chapter has established the SCN conceptual framework: cross-node incomparability of LSN is the fundamental motivation for introducing a Lamport clock; the 8-byte encoding separates node_id (high 8 bits) from local_scn (low 56 bits), with the former guaranteeing global uniqueness and the latter carrying causal order; the three advance paths (commit +1, piggyback max+1, WAL stamp read-only) cover all SCN write scenarios; BOC and Piggyback are complementary, providing propagation safety during low-activity periods and zero-overhead synchronization during high-activity periods respectively; per-thread xl_scn reduces WAL hot-path CAS contention to an atomic load; and safety_gap plus periodic fsync ensure SCN does not regress after a crash.
For deep protocol details, see the following resources:
scn_encode / scn_time_cmp / scn_recovery_cmp), BOC batch flush timing, persistence file layout, SCN handling during Reconfig freeze periods, complete algorithms for the visibility path and WAL k-way mergemsg_scn piggyback field in Cache Fusion block transfer message headers; field definitions for block pd_block_scn and ITL commit_scnChapter 5 — Reconfiguration covers cluster state reconstruction when the node topology changes: how SCN maintains monotonicity during the freeze / unfreeze phases when nodes leave or join, how a failed node's local_scn is reconstructed from WAL recovery's xl_scn, and the complete flow by which a joining node achieves SCN convergence via KEEPALIVE + piggyback.