This chapter describes pgrac's WAL upgrade direction on top of PG's native WAL and annotates each item with its Stage status. The three design lines come from AD-008 (Lamport SCN embedded as optional field on commit/abort WAL records — xl_scn), AD-009 (per-instance redo thread, shared-storage WAL streams, k-way SCN merged replay), and AD-006 (MVCC introduces ITL / TT / Undo records into WAL). Together these target semantic alignment with Oracle Redo Records: commit/abort records carry SCN, every thread is monotonic within itself, and multiple threads across instances are merged and replayed in SCN order.
Stage 1 shipped (spec-1.18 + 1.19) delivers two WAL-layer practicals: commit/abort WAL +8 B optional xl_xact_scn (gated by XACT_XINFO_HAS_SCN bit 9), and a WAL page header 4-byte placeholder (xlp_thread_id + xlp_cluster_flags, reusing MAXALIGN padding so on-disk page header stays 24 B and catversion is not bumped). Stage 4 (planned: spec-4.1 — 4.13) truly activates per-instance WAL streams (pg_wal_threads/thread_NN/), k-way SCN merge replay, the new RMGRs (ITL / TT / Undo / PCM / BOC / Reconfig / Generic), the Cache Fusion Write-Ahead Rule, and cluster crash recovery. This chapter respects that boundary: Stage 1 implementation facts and Stage 4 design goals are annotated separately.
PG's native WAL is a single-stream + single-thread LSN design: all backends share one WAL insert lock and recovery replays serially in a single process. AD-009 extends this model to per-instance independent streams: each node has its own pg_wal_threads/thread_NN/ directory (WAL design doc §4.2 / spec-1.19 §235; mapping rule thread_id = node_id + 1, keeping 0 as legacy sentinel), writes do not block each other, and recovery merges streams globally in SCN order. Stage 1's current implementation still uses a single pg_wal/ directory and a single stream; per-instance WAL routing is deferred to Stage 4.1 (development-roadmap §4.3).
| Dimension | PG native | pgrac Stage 1 current | pgrac Stage 4 design target |
|---|---|---|---|
| WAL directory | pg_wal/ | pg_wal/ (unchanged) | pg_wal_threads/thread_NN/ |
| Number of streams | 1 | 1 | Number of nodes (typically 2–8) |
| LSN space | Single LSN sequence | Single LSN sequence | Independent LSN per stream; global order by xl_xact_scn |
| WAL file visibility | Local only | Local only | Shared storage, readable from any node |
| WAL record header (on-disk) | 24 B | 24 B (commit/abort may carry optional +8 B xl_xact_scn, gated by bit 9 flag) | 32 B unconditional (AD-008 second extension target) |
| WAL page header (on-disk) | 24 B | 24 B (spec-1.19 reuses padding to add xlp_thread_id 2 B + xlp_cluster_flags 2 B; no catversion bump) | same as Stage 1 (spec-1.19 permanent placeholder) |
| Record types | heap / btree / xact, etc. | Native types only (no new RMGR shipped) | + ITL / TT / Undo / PCM / BOC / Reconfig / Generic (7–8 new RMGRs; WAL design doc §6.1) |
| Recovery mode | Single-process serial replay | Single-process serial replay + xl_xact_scn observe to rebuild SCN (spec-1.18 D7) | K-way SCN merge replay (Stage 4.5) |
| WAL volume (DML-heavy) | baseline | baseline + 8 B per commit/abort (negligible) | ~5.5× projection (WAL design doc §9.3; includes ITL + Undo + new RMGR) |
PG's native XLogRecord is a fixed 24 B: 4 B length, 4 B xid, 8 B prev LSN, 1 B info, 1 B rmid, 2 B padding, 4 B CRC. These 24 B carry no SCN or global ordering information — a single monotonic LSN is sufficient for a single stream, but cannot establish global order across multiple streams.
Stage 1 implementation (spec-1.18): the record header on-disk layout is unchanged. xl_scn is implemented as an optional 8-byte field on commit / abort records: when cluster.enabled=on and a commit_scn was truly assigned, XactLogCommitRecord / XactLogAbortRecord set XACT_XINFO_HAS_SCN (bit 9) and append 8 B of xl_xact_scn immediately after xl_xact_origin. XLOG_XACT_PREPARE does not carry xl_scn (PREPARE is not a durable commit point — spec-1.16 Q5); XLOG_XACT_COMMIT_PREPARED / XLOG_XACT_ABORT_PREPARED do (they are real durable points). Bootstrap mode and cluster.enabled=off paths omit the HAS_SCN flag — making this path byte-identical to vanilla PG 16 and preserving initdb / pg_upgrade compatibility.
The ClusterXLogRecord described in WAL design doc §5 (unconditional 32 B header, xl_scn on every record) is the AD-008 second extension target state. Stage 1 does not implement this; spec-1.18 chose the more conservative optional-flag path so that vanilla PG compatibility and SCN-aware crash recovery can coexist.
Stage 1 implementation (spec-1.19, WAL Page Header): reuses the existing 4 B MAXALIGN padding (offset 20) in XLogPageHeaderData to add two fields: xlp_thread_id (uint16, offset 20) + xlp_cluster_flags (uint16, offset 22). The page header on-disk size remains 24 B, with no catversion bump. During Stage 1, xlp_thread_id is permanently hard-coded to XLP_THREAD_ID_LEGACY = 0 (legacy sentinel); when per-instance routing activates in Stage 2+, the mapping is thread_id = node_id + 1 (reserving 0 for legacy). xlogreader.c adds a validator hook to enforce the Stage 1 invariant (thread_id must be LEGACY, flags must be RESERVED).
Shared PG / pgrac 24 B record header (on-disk unchanged):
+--------+--------+--------+--------+--------+--------+
| xl_tot_len (4) | xl_xid (4) | xl_prev (8) |
+--------+--------+--------+--------+--------+--------+
| xl_info | xl_rmid | (padding) | xl_crc (4) |
+--------+--------+--------+--------+--------+--------+
spec-1.18 optional commit/abort extension (XACT_XINFO_HAS_SCN bit 9):
+--------+--------+--------+--------+
| xl_xact_scn (8 B) | ← immediately after xl_xact_origin
+--------+--------+--------+--------+
spec-1.19 WAL Page Header reuses MAXALIGN padding (on-disk still 24 B):
+----+----+----+----+----+----+----+----+
| magic | info | tli | bytes 0-7
+----+----+----+----+----+----+----+----+
| pageaddr (8) | bytes 8-15
+----+----+----+----+----+----+----+----+
| rem_len (4) | tid (2) | flags (2) | bytes 16-23
+----+----+----+----+----+----+----+----+
↑ ↑
xlp_thread_id xlp_cluster_flags
(Stage 1: =0 legacy sentinel)
WAL insert critical section (spec-1.16 + 1.17): the commit path's cluster_scn_advance_for_commit() obtains commit_scn via a lock-free pg_atomic_fetch_add_u64 (spec-1.17 Q1 removes LWLock from the hot path) and passes it to XactLogCommitRecord. The cluster_scn_observe() envelope-verify path and the WAL replay observe path both use a lock-free CAS retry loop (spec-1.16 + spec-2.12 P1.1 internalization) — introducing LWLock into these hot paths is forbidden.
The 7–8 new RMGRs listed below come from WAL design doc §6.1 (2026-04-25 v1.0) and are intended to carry write-ahead for cluster-specific operations: ITL slot, TT slot, Undo record, PCM transition, BOC record, Reconfig event, and generic cluster events. As of Stage 2 closure, none of these RMGRs is implemented in linkdb. spec-1.20-1.22 (TT slot typedef + undo segment header placeholder) only ship the data-structure typedefs without integrating into the commit path or introducing the corresponding RMGRs. The true activation of new RMGRs is deferred to the corresponding Stage 3-4 specs. This section presents the design goal, not what the current binary implements.
AD-006 design goal: ITL slot, TT slot, and undo record (three cluster-specific operation classes) must be written to WAL — this is the foundation of write-ahead undo correctness. WAL design doc plans 7-8 RMGRs that extend the PG native RMGR framework while maintaining binary compatibility:
| RMGR | Purpose |
|---|---|
RM_CLUSTER_ITL_ID | ITL slot write / cleanout / commit_scn writeback |
RM_CLUSTER_TT_ID | TT slot alloc / commit / abort / wrap reuse |
RM_CLUSTER_UNDO_ID | Undo record write / segment state change |
RM_CLUSTER_PCM_ID | PCM lock transitions (off by default; enabled only for debug / reconfig) |
RM_CLUSTER_BOC_ID | Broadcast on Commit SCN broadcast record |
RM_CLUSTER_RECONFIG_ID | Reconfig phase switch / Coordinator election |
RM_CLUSTER_GEN_ID | General cluster event extension point |
Typical OLTP transaction (5 DML statements) WAL volume projection (WAL design doc §9.3): PG native ~230 B / tx, pgrac design target ~1,280–1,300 B / tx (5.5×). The increment comes mainly from ITL write records (~54 B × 5) and undo write records (~112 B × 5); header overhead is a small fraction. This is a design-goal projection, not measured data. Stage 1's current measurement (spec-1.23 pgbench TPC-B baseline, 27 combos) shows that the cluster-ON path produces a -12% TPS overhead at scale=100 c8; candidate RCA is spec-1.17 BOC tick + spec-1.16 commit hot path + spec-1.18 xl_scn write, with optimization deferred to Stage 6 perf hardening.
Every ITL (Interested Transaction List) slot write, cleanout, and commit_scn writeback must produce a corresponding WAL record, enabling recovery to precisely reconstruct the per-block transaction state:
/* ITL slot write (produced alongside the heap record during DML) */
typedef struct ItlWriteRecord {
BlockNumber target_block; /* 4 B target block address */
uint8 itl_slot_idx; /* 1 B slot index (0-based) */
uint8 info; /* 1 B CREATE / UPDATE / CLEANUP */
ClusterItlSlotData slot_data; /* 48 B complete ITL slot contents */
/* total: ~54 B */
} ItlWriteRecord;
/* ITL cleanout (reader-triggered deferred cleanup — must also go through WAL) */
typedef struct ItlCleanoutRecord {
BlockNumber target_block; /* 4 B */
uint8 itl_slot_idx; /* 1 B */
SCN commit_scn; /* 8 B commit_scn written back to the ITL slot */
/* total: ~16 B */
} ItlCleanoutRecord;
ITL cleanout is a reader-triggered dirty write (writing commit_scn back into the block header) and must go through WAL; otherwise recovery cannot distinguish block state before and after cleanout. TT slot alloc / commit / abort / reuse each have their own corresponding records, together forming the complete transaction state chain.
The Cache Fusion block transfer protocol (feature-019 / cache-fusion-protocol-design) is planned for true activation in Stage 5; the gcs_wa.c send path, the synchronous XLogFlush(pi_lsn) call, and the batch fsync optimization described here are design targets. What Stage 2 has shipped is the IC envelope frame + epoch enforce + Lamport piggyback + GES daemon skeleton (spec-2.4 / 2.13 / 2.18-2.23); the 3-way Cache Fusion protocol, the PI chain, and the Write-Ahead Rule are deferred to Stage 5.
Before Cache Fusion transfers a dirty block, the corresponding WAL must already be fsync'd — this is the core constraint specified by feature-019 (Write-ahead global version) and the foundation of all Cache Fusion durability guarantees:
PG single-node rule:
Before dirty block is flushed to disk → corresponding WAL must be fsync'd
pgrac / Cache Fusion extended rule:
Before dirty block is transferred cross-node → corresponding WAL must be fsync'd
(including the case where the receiver continues to modify the block)
In the implementation, the Cache Fusion send path (gcs_wa.c) — after packing the block data and before calling the Interconnect to send — synchronously calls XLogFlush(pi_lsn), waiting for local WAL to be durably persisted. A batch optimization (group commit-like) allows multiple pending blocks to share a single fsync, reducing I/O count.
Write-ahead is a synchronous blocking operation: the Cache Fusion send path suspends until XLogFlush completes. If LGWR stalls and times out, reconfiguration is triggered; WAL synchronization cannot be downgraded or skipped (correctness takes priority over availability).
WAL fsync latency is one of the primary components of Cache Fusion end-to-end latency. On NVMe shared storage, a typical WAL fsync completes in under 200 μs — far below the ~5 μs RDMA round-trip of the Interconnect. Consequently the overall latency bottleneck is typically not Write-Ahead but rather the number of GRD master coordination rounds.
per-instance WAL routing (spec-4.1), k-way SCN merge replay (spec-4.5), Recovery Coordinator-MRP (spec-4.4), the 5-subsystem crash recovery, online block / thread recovery, and split-brain guard — 13 Stage 4 specs in total — are deferred to Stage 4 (development-roadmap §4.3). Stage 1's current crash recovery still uses the PG native single-process serial replay; only on commit/abort records does spec-1.18 D7 (the cluster_scn_recovery_replay_observe wrapper) rebuild current_local_scn. This section describes the Stage 4 design goal.
Each pgrac node maintains a thread-local WAL writer (LGWR) that independently manages that node's pg_wal_threads/thread_NN/ directory. The WAL buffer is recommended to be configured at 32 MB (vs PG native default 16 MB) to accommodate concurrent writes of more record types. The xl_scn monotonicity invariant guarantees: within this thread, a record with a larger LSN will always have an xl_scn no smaller than a record with a smaller LSN.
After a crash, the surviving node (or the newly elected recovery coordinator) reads the WAL streams of all failed nodes and reconstructs a consistent state via K-way SCN merge replay:
/* K-way merge — priority queue sorted ascending by xl_scn */
PriorityQueue pq; /* min-heap, ordered by head_record.xl_scn */
for each thread t in merge_set:
t.head = read_next(t);
if (t.head) pq.push(t);
while (!pq.empty()) {
t = pq.pop_min(); /* pop the record with the globally smallest xl_scn */
dispatch_record(t.head);/* dispatch by block hash + apply redo */
t.head = read_next(t);
if (t.head) pq.push(t);
}
Correctness depends on two guarantees: per-thread xl_scn monotonicity (WAL insert lock protocol, §12.2); and Cache Fusion serialization (cross-thread SCN monotonicity for the same block, Write-Ahead Rule, §12.4). Together these ensure that the merged order is consistent with the original write order.
Per-instance redo streams (each node writes independently):
Node 1: ●─●─●─●──────●── (thread A: SCN 42, 43; thread B: 44, 50; thread C: 61)
Node 2: ●─●─●────●─●───── (thread D: SCN 12; thread E: 44, 45; thread F: 55, 60)
Node 3: ●─●────────────── (thread G: SCN 8, 47)
↓ sort and merge by SCN ↓
Merged: ●─●─●─●─●─●─●─●─●─● (replay order)
SCN: 8 12 42 43 44 45 47 50 55 60 61
Apply: redo each record in order. Each record is applied only to its corresponding block;
per-thread order is preserved within each stream (guaranteed by thread_id).
Merged recovery uses K-way SCN merge in both cluster instance failure (partial node crash) and PITR scenarios. Single-machine crash recovery (only one stream) degrades to PG's native single-threaded replay; the only overhead is the new RMGR redo handlers (~+5% time). This algorithm is not implemented in Stage 1; when Stage 4.5 truly activates it, the input data format is provided by the spec-1.18 xl_xact_scn + spec-1.19 xlp_thread_id placeholder described in §12.2 of this chapter.
For deeper design details and related features:
ClusterXLogRecord / ClusterXLogPageHeader C structs, 7–8 new RMGR redo handler implementations, WAL insert critical section pseudocode, pg_cluster_wal_stats / pg_cluster_wal_scn_check view fields, WAL compression extension (lz4 2.1× ratio)xl_scn semantics (commit_scn vs local_scn), Lamport advance protocol, BOC piggyback update path, formal proof of per-thread monotonicity invariantpi_lsn and flushed_lsn, batch fsync optimization parametersRM_CLUSTER_RECONFIG_ID records written during phase switches, Coordinator state restored after recovery replay