The previous chapter (Ch 11) described the buffer pool three-copy model (XCUR / SCUR / PI) and PCM lock coordination: when dirty buffers are produced and when they are evicted. This chapter goes deeper into the WAL layer — the durability threshold that every dirty block must cross before reaching disk — and how pgrac implements per-instance redo streams on top of PG's native WAL, extends the record header, introduces cluster-specific record types, and safely replays across nodes via merged recovery after a crash.
The core motivation for pgrac's WAL upgrade comes from three directions: AD-009 (per-instance redo thread — each node has its own independent WAL stream), AD-008 (Lamport SCN embedded in the WAL record header — xl_scn 8 B extension), and AD-006 (MVCC introduces ITL / TT / Undo records). Together these align pgrac WAL semantically with Oracle Redo Records: every record carries an SCN, every thread is monotonically increasing within itself, and multiple threads are merged and replayed in SCN order.
PG's native WAL is a single-stream + single-thread LSN design: the entire cluster (effectively a single machine) writes to consecutive segment files under pg_wal/, all backends share a single WAL insert lock, and recovery replays serially in a single process. pgrac extends this model to per-instance independent streams: each node has its own pg_wal_node_N/ directory, writes do not block each other, and on recovery the streams are merged globally in SCN order.
| Dimension | PG native | pgrac |
|---|---|---|
| WAL directory | pg_wal/ (single directory) | pg_wal_node_1/ … pg_wal_node_N/ (one directory per node) |
| Number of streams | 1 | Number of nodes (typically 2–8) |
| LSN space | Single LSN sequence shared across the cluster | Independent LSN per stream; global order determined by xl_scn |
| WAL file visibility | Local node only | All files on shared storage, readable from any node |
| WAL record header | 24 B (no SCN) | 32 B (includes xl_scn 8 B; page header includes thread_id) |
| Record types | heap / btree / xact, etc. | Native types + ITL / TT / Undo / PCM / BOC / Reconfig |
| Recovery mode | Single-process serial replay | K-way SCN merge replay (cross-stream, sorted by SCN) |
| WAL volume (DML-heavy) | baseline | ~5.5× (header overhead + ITL + Undo records) |
PG's native XLogRecord is a fixed 24 B: 4 B length, 4 B xid, 8 B prev LSN, 2 B info/rmid, 2 B padding, 4 B CRC. These 24 B carry no timestamp or global ordering information; recovery can only advance linearly by LSN — which is fine for a single stream, but provides no way to establish global order across multiple streams.
pgrac appends 8 B of xl_scn after the existing 24 B in ClusterXLogRecord, forming a 32 B header. In a commit record xl_scn holds commit_scn; in all other records it holds the local_scn at the time of writing, atomically read within the WAL insert lock to guarantee the per-thread monotonicity invariant (AD-008 second extension): within the same thread, if R1.LSN < R2.LSN then R1.xl_scn ≤ R2.xl_scn.
thread_id is not placed in every record header (to avoid redundancy); instead it lives in the xlp_thread_id field (2 B) of the WAL page header: all pages in the same stream share the same thread_id, so recovery can read it once from the page header rather than per-record.
PG native 24 B (retained for compatibility):
+--------+--------+--------+--------+--------+--------+
| xl_tot_len (4) | xl_xid (4) | xl_prev (8) |
+--------+--------+--------+--------+--------+--------+
| xl_info | xl_rmid | (padding) | xl_crc (4) |
+--------+--------+--------+--------+--------+--------+
pgrac extension 8 B:
+--------+--------+--------+--------+--------+
| xl_scn (8) | tid (1) |
+--------+--------+--------+--------+--------+
↑
thread_id
Total = 32 B
Within the WAL insert critical section, the three steps — "read local_scn + allocate LSN slot + write xl_scn" — are executed atomically, ensuring xl_scn monotonicity is not broken by concurrent writes. When BOC (Broadcast on Commit) piggybacks an update to local_scn, the WAL insert lock must also be held, preventing external SCN advances from interleaving with local writes.
AD-006 introduced three cluster-specific operation classes — ITL slots, TT slots, and undo records — whose changes must be written to WAL; this is the foundation of write-ahead undo correctness. pgrac adds 7–8 new RMGRs (Resource Managers) extending the PG native RMGR framework, maintaining binary compatibility:
| RMGR | Purpose |
|---|---|
RM_CLUSTER_ITL_ID | ITL slot write / cleanout / commit_scn writeback |
RM_CLUSTER_TT_ID | TT slot alloc / commit / abort / wrap reuse |
RM_CLUSTER_UNDO_ID | Undo record write / segment state change |
RM_CLUSTER_PCM_ID | PCM lock transitions (off by default; enabled only for debug / reconfig) |
RM_CLUSTER_BOC_ID | Broadcast on Commit SCN broadcast record |
RM_CLUSTER_RECONFIG_ID | Reconfig phase switch / Coordinator election |
RM_CLUSTER_GEN_ID | General cluster event extension point |
Typical OLTP transaction (5 DML statements) WAL volume comparison: PG native ~230 B / tx, pgrac ~1,280–1,300 B / tx (5.5×). The increment comes mainly from ITL write records (~54 B × 5) and undo write records (~112 B × 5); header overhead (+8 B / record) is a small fraction. On NVMe Tier 1 deployments, OLTP TPS impact is < 5%.
Every ITL (Interested Transaction List) slot write, cleanout, and commit_scn writeback must produce a corresponding WAL record, enabling recovery to precisely reconstruct the per-block transaction state:
/* ITL slot write (produced alongside the heap record during DML) */
typedef struct ItlWriteRecord {
BlockNumber target_block; /* 4 B target block address */
uint8 itl_slot_idx; /* 1 B slot index (0-based) */
uint8 info; /* 1 B CREATE / UPDATE / CLEANUP */
ClusterItlSlotData slot_data; /* 48 B complete ITL slot contents */
/* total: ~54 B */
} ItlWriteRecord;
/* ITL cleanout (reader-triggered deferred cleanup — must also go through WAL) */
typedef struct ItlCleanoutRecord {
BlockNumber target_block; /* 4 B */
uint8 itl_slot_idx; /* 1 B */
SCN commit_scn; /* 8 B commit_scn written back to the ITL slot */
/* total: ~16 B */
} ItlCleanoutRecord;
ITL cleanout is a reader-triggered dirty write (writing commit_scn back into the block header) and must go through WAL; otherwise recovery cannot distinguish block state before and after cleanout. TT slot alloc / commit / abort / reuse each have their own corresponding records, together forming the complete transaction state chain.
Before Cache Fusion transfers a dirty block, the corresponding WAL must already be fsync'd — this is the core constraint specified by feature-019 (Write-ahead global version) and the foundation of all Cache Fusion durability guarantees:
PG single-node rule:
Before dirty block is flushed to disk → corresponding WAL must be fsync'd
pgrac / Cache Fusion extended rule:
Before dirty block is transferred cross-node → corresponding WAL must be fsync'd
(including the case where the receiver continues to modify the block)
In the implementation, the Cache Fusion send path (gcs_wa.c) — after packing the block data and before calling the Interconnect to send — synchronously calls XLogFlush(pi_lsn), waiting for local WAL to be durably persisted. A batch optimization (group commit-like) allows multiple pending blocks to share a single fsync, reducing I/O count.
Write-ahead is a synchronous blocking operation: the Cache Fusion send path suspends until XLogFlush completes. If LGWR stalls and times out, reconfiguration is triggered; WAL synchronization cannot be downgraded or skipped (correctness takes priority over availability).
WAL fsync latency is one of the primary components of Cache Fusion end-to-end latency. On NVMe shared storage, a typical WAL fsync completes in under 200 μs — far below the ~5 μs RDMA round-trip of the Interconnect. Consequently the overall latency bottleneck is typically not Write-Ahead but rather the number of GRD master coordination rounds.
Each pgrac node maintains a thread-local WAL writer (LGWR) that independently manages that node's pg_wal_node_N/ directory. The WAL buffer is recommended to be configured at 32 MB (vs PG native default 16 MB) to accommodate concurrent writes of more record types. The xl_scn monotonicity invariant guarantees: within this thread, a record with a larger LSN will always have an xl_scn no smaller than a record with a smaller LSN.
After a crash, the surviving node (or the newly elected recovery coordinator) reads the WAL streams of all failed nodes and reconstructs a consistent state via K-way SCN merge replay:
/* K-way merge — priority queue sorted ascending by xl_scn */
PriorityQueue pq; /* min-heap, ordered by head_record.xl_scn */
for each thread t in merge_set:
t.head = read_next(t);
if (t.head) pq.push(t);
while (!pq.empty()) {
t = pq.pop_min(); /* pop the record with the globally smallest xl_scn */
dispatch_record(t.head);/* dispatch by block hash + apply redo */
t.head = read_next(t);
if (t.head) pq.push(t);
}
Correctness depends on two guarantees: per-thread xl_scn monotonicity (WAL insert lock protocol, §12.2); and Cache Fusion serialization (cross-thread SCN monotonicity for the same block, Write-Ahead Rule, §12.4). Together these ensure that the merged order is consistent with the original write order.
Per-instance redo streams (each node writes independently):
Node 1: ●─●─●─●──────●── (thread A: SCN 42, 43; thread B: 44, 50; thread C: 61)
Node 2: ●─●─●────●─●───── (thread D: SCN 12; thread E: 44, 45; thread F: 55, 60)
Node 3: ●─●────────────── (thread G: SCN 8, 47)
↓ sort and merge by SCN ↓
Merged: ●─●─●─●─●─●─●─●─●─● (replay order)
SCN: 8 12 42 43 44 45 47 50 55 60 61
Apply: redo each record in order. Each record is applied only to its corresponding block;
per-thread order is preserved within each stream (guaranteed by thread_id).
Merged recovery uses K-way SCN merge in both cluster instance failure (partial node crash) and PITR scenarios. Single-machine crash recovery (only one stream) degrades to PG's native single-threaded replay; the only overhead is the new RMGR redo handlers (~+5% time).
For deeper design details and related features:
ClusterXLogRecord / ClusterXLogPageHeader C structs, 7–8 new RMGR redo handler implementations, WAL insert critical section pseudocode, pg_cluster_wal_stats / pg_cluster_wal_scn_check view fields, WAL compression extension (lz4 2.1× ratio)xl_scn semantics (commit_scn vs local_scn), Lamport advance protocol, BOC piggyback update path, formal proof of per-thread monotonicity invariantpi_lsn and flushed_lsn, batch fsync optimization parametersRM_CLUSTER_RECONFIG_ID records written during phase switches, Coordinator state restored after recovery replay