The previous chapter (Ch 10) described the physical layout of per-instance undo tablespaces and the cross-node visibility path: undo records live in each instance's independent segment, and CR block construction traverses the undo chain via UBA in reverse application. This chapter goes deeper into the buffer pool layer — where undo and heap blocks ultimately reside — and how pgrac extends PG's native single-machine buffer pool with cross-node buffer coordination.
The core challenge for the pgrac buffer pool is extending PG's single-machine single-copy model into a cluster three-copy model (XCUR / SCUR / PI) without breaking PG's native hot-path performance, while maintaining global buffer coherency through the PCM lock state machine (AD-002) and Cache Fusion protocol (AD-005). AD-006 PIVOT B introduced one important simplification: CR blocks no longer occupy dedicated buffer slots — instead they are constructed on demand via the undo chain at row granularity, keeping the BufTable hash in a single BufferTag dimension.
PG's native buffer pool is a single-machine + single-version design: at most one copy (current) of each block exists in memory, all intra-node concurrency is serialized by LWLock (content_lock), there is no cross-instance coherency protocol, and no CR / PI concept. pgrac adds cross-node copy semantics on top of this, while preserving PG's BufTable hash path and pin/unpin mechanics — minimal invasiveness.
| Dimension | PG native | pgrac |
|---|---|---|
| In-memory copies per block | 1 (current) | Up to 3 (XCUR / SCUR / PI); CR is constructed on demand and takes no slot |
| Cross-node coherency | ❌ None | PCM lock state machine (N/S/X) + Cache Fusion |
| Visibility copies | Heap dead tuples + CLOG | XCUR/SCUR current + undo chain construction |
| BufferTag | RelFileLocator + ForkNumber + BlockNumber (20 B) | Unchanged (CR/PI associated via chain, not in BufTable) |
| BufferDesc size | 64 B (1 cache line) | 128 B (2 cache lines; hot fields all in first 64 B) |
| Eviction policy | Clock-sweep (single priority) | Three-pool differentiated: PI > SCUR > XCUR eviction priority (descending) |
| Cross-node block access | ❌ Not supported | Cache Fusion RDMA transfer (~5 μs Tier 1) |
| CR block | ❌ Not supported | Under AD-006 takes no buffer slot; row-level view constructed via undo chain |
Every pgrac buffer slot belongs to exactly one copy type at any moment, derived from the pcm_state and pi_flags fields (not independent fields — zero redundancy):
| Type | Meaning | Cluster uniqueness | Mapping |
|---|---|---|---|
| XCUR (Exclusive Current) | Exclusive write; unique across the cluster | At most 1 node holds it cluster-wide | pcm_state = X, has_pi = false |
| SCUR (Shared Current) | Shared read; multiple nodes may hold simultaneously | Multiple nodes coexist | pcm_state = S, has_pi = false |
| PI (Past Image) | Stale dirty copy retained after X lock relinquishment | Each node holds its own independent PI | has_pi = true, pcm_state = any |
| CR (Consistent Read) | Constructed on demand; takes no buffer slot | — | Under AD-006, constructed via undo chain (#121) |
AD-006 PIVOT B is an important simplification: CR blocks no longer occupy dedicated buffer slots the way Oracle does. pgrac's CR construction performs row-level replay against the undo chain in cluster_visibility.c (#121); the buffer pool always stores only the current version of each block. This keeps the BufTable hash dimension identical to PG native, requiring no additional hash keys for historical versions.
cluster-wide buffer state
Node 1 Node 2 Node 3
┌────────┐ ┌────────┐ ┌────────┐
│ pool │ │ pool │ │ pool │
│ │ │ │ │ │
block A: │ XCUR │ ──── X ──── │ · │ ─── X ──── │ · │ exclusive write
│ │ │ │ │ │
block B: │ SCUR │ ──── S ──── │ SCUR │ ─── S ──── │ SCUR │ shared read
│ │ │ │ │ │
block C: │ CR │ │ · │ │ CR │ constructed on demand
│ @SCN 99│ │ │ │ @SCN 99│ (no dedicated slot)
│ │ │ │ │ │
block D: │ PI │ │ XCUR │ │ PI │ stale page retained
│ @SCN 75│ │ @SCN 80│ │ @SCN 75│ (ordered by SCN)
└────────┘ └────────┘ └────────┘
C macro derivation of copy type:
typedef enum {
BCT_FREE, /* empty / freelist */
BCT_INVALID, /* has tag but content invalid, awaiting Cache Fusion fetch */
BCT_XCUR, /* pcm_state=X, has_pi=false */
BCT_SCUR, /* pcm_state=S, has_pi=false */
BCT_PI, /* has_pi=true, pcm_state=any (usually N or S) */
} BufferCopyType;
#define BUFFER_COPY_TYPE(bd) \
((bd)->pi_flags.has_pi ? BCT_PI : \
((bd)->pcm_state == PCM_MODE_X ? BCT_XCUR : \
((bd)->pcm_state == PCM_MODE_S ? BCT_SCUR : BCT_INVALID)))
Copy type is a derived view — it can be computed unambiguously from pcm_state and pi_flags at any point in time, introducing no additional field maintenance burden.
pgrac extends PG's native BufferDesc (64 B) to 128 B, appending cluster fields guarded by the USE_PGRAC_CLUSTER compile guard. This follows the same pattern as Ch 9's PageHeaderData extension and Ch 10's undo segment header extension: extend existing PG structs rather than introduce parallel structures.
/* BufferDesc — PG 16.13 measured layout (USE_PGRAC_CLUSTER mode, 128 B)
* Conceptually named ClusterBufferDesc; code retains PG's original name BufferDesc
* with compile-guard-appended fields.
*/
typedef struct BufferDesc {
/* === Cache line 1 first half: PG original fields [0, 52), HOT, compatible with PG vanilla === */
BufferTag tag; /* 20 B: RelFileLocator(12) + ForkNumber(4) + BlockNumber(4) */
int buf_id; /* 4 B */
pg_atomic_uint32 state; /* 4 B: refcount + usage_count + flags */
int wait_backend; /* 4 B */
int freeNext; /* 4 B */
LWLock content_lock; /* 16 B; ends at offset 52 */
/* === Cache line 1 cluster hot tail [52, 64), 12 B; hot path access === */
uint8 buffer_type; /* offset 52: BUF_TYPE_CURRENT / CR / PI (derived; redundant snapshot) */
uint8 pcm_state; /* offset 53: N / S / X */
uint8 pi_flags; /* offset 54: has_pi and related bits */
uint8 _pad; /* offset 55: 1 B padding for 8 B alignment of block_scn */
SCN block_scn; /* offset 56: 8 B; ends at 64 = cache line 1 boundary */
/* === Cache line 2 cold body [64, 128), 64 B; cluster-specific paths only === */
int cr_chain_head; /* offset 64: PIVOT B — moved here (CR construction is cold path) */
int cr_chain_next; /* offset 68 */
SCN cr_scn; /* offset 72: CR buffers only (not used for dedicated slots under AD-006) */
int pi_buf_id; /* offset 80 */
XLogRecPtr pi_lsn; /* offset 88: PI buffers only */
uint16 grd_master_node; /* offset 96 */
uint16 grd_master_seq; /* offset 98 */
uint8 cf_state; /* offset 100: Cache Fusion protocol state */
uint8 cf_owner_node; /* offset 101 */
uint16 cf_request_count; /* offset 102 */
LWLock pcm_lock; /* offset 104: accessed only during lock transition */
TimestampTz pi_created_at; /* offset 120: ends at 128 */
/* total: 128 B (BUFFERDESC_PAD_TO_SIZE = 128 in USE_PGRAC_CLUSTER mode) */
} BufferDesc;
v1.2 (2026-05-02) uncovered a critical measured finding during implementation: PG 16.13's sizeof(BufferTag) = 20 B (RelFileLocator 12 B + ForkNumber 4 B + BlockNumber 4 B = 20 B), not the 16 B assumed in early design documents. This pushes PG's original fields to occupy offset [0, 52), leaving the cluster hot tail only 12 B — not enough to simultaneously hold cr_chain_head (4 B) and block_scn (8 B) while keeping block_scn within cache line 1.
PIVOT B trade-off: block_scn is the critical field on the Stage 2–3 visibility hot path (every buffer access must compare block_scn against snapshot.read_scn) and must reside in cache line 1. cr_chain_head is only accessed during CR construction (a cold path) — it is sacrificed to free space in cache line 1, moved to the start of cache line 2 (offset 64).
hot path access pattern (cache line 1 only = first 64 B):
BufTableLookup → IncreaseRefcount → read pcm_state → read block_scn → LWLockAcquire(content_lock)
cache line 2 is never touched; identical overhead to PG native hot path (1 cache miss)
cold path (cache line 2, triggered only in new scenarios):
CR construction → access cr_chain_head / cr_chain_next / cr_scn
PI creation → access pi_buf_id / pi_lsn / pi_created_at
Cache Fusion → access cf_state / cf_owner_node / pcm_lock
At compile time, five StaticAssertDecl statements lock layout invariants via semantic constraints — for example, offsetof(block_scn) + sizeof(SCN) <= 64 (block_scn within cache line 1) and offsetof(cr_chain_head) >= 64 (cr_chain_head at the start of cache line 2) — rather than hardcoded magic offset numbers. If a future PG version expands BufferTag again, the assertions fire at compile time rather than silently miscalculating.
pgrac buffer pool concurrency safety is jointly guaranteed by two orthogonal and independent dimensions that cannot be merged:
Dimension 1: Pin (refcount)
refcount > 0 prevents a buffer from being evictedDimension 2: PCM Lock (N/S/X)
pcm_state field stored in ClusterBufferDesc hot tail (offset 53)/* Valid combinations of the two dimensions */
/* Pin + S: backend holds buffer reference, node holds shared PCM lock, can read locally */
/* Pin + X: backend holds buffer reference, node holds exclusive PCM lock, can write locally */
/* Unpinned + X: no backend reference but node still holds X lock → cannot evict immediately (see below) */
/* Pin + N: intermediate state during PCM lock transition → rare but valid */
Critical constraint for eviction and PCM X lock: a buffer holding a PCM X lock cannot be evicted directly even when refcount = 0 (unpinned). The reason is that the PCM X lock signals to the GRD that "the master of this block is on this node" — direct eviction would desynchronize GRD state from local buffer state. The correct path is to first notify the GRD to release the X lock (pcm_release_x_lock), transition the node's pcm_state → N, flush the dirty block, then remove it from BufTable and return the buffer slot.
Acquisition order: PCM lock and content_lock are always acquired strictly in "PCM first, then content" order to prevent deadlock (§5 of AD-002 design document provides a complete formal proof).
The 9 valid PCM state transitions (from AD-002):
| Transition | Triggering scenario |
|---|---|
| N → S | Node's first read of the block |
| N → X | Node's first write to the block |
| S → X | Node upgrade: a block held in S needs to be written |
| X → S | Another node requests read; this node downgrades (retaining PI) |
| X → N | Invalidate received (no PI retained) |
| S → N | Invalidate received |
| X → S (retain PI) | has_pi = true orthogonal flag takes effect with downgrade |
| N → S (with GRD PI fast-path) | Skip disk read; fetch from PI holder |
| ITL cleanout triggers S → X | Reader performs delayed cleanout, requires brief upgrade to write |
pgrac adapts PG's clock-sweep eviction with a three-pool differentiated scheme, giving XCUR (write-hot data) the highest retention priority, allowing PI to be evicted when necessary (with TTL protection), and placing SCUR in between.
Three-pool static partition (default; GUC-tunable):
| Pool | Default share | Size (shared_buffers = 16 GB) | Eviction priority |
|---|---|---|---|
| CurrentPool (XCUR / SCUR) | 60% | 9.6 GB | Lowest (most precious) |
| PIPool | 10% | 1.6 GB | Medium (evictable after TTL 5 min) |
| Reserve | 10% | 1.6 GB | Dynamically adjusted |
CR blocks take no dedicated buffer slots (AD-006), so the CRPool (20%) from the original design no longer requires a physical pool partition after PIVOT B; that space is merged into CurrentPool for OLTP hot data.
Adapted StrategyGetBuffer three-stage flow:
StrategyGetBuffer():
/* 1. Prefer PI copies in PIPool that have exceeded TTL (> 5 min) */
victim = sweep_pi_pool_expired();
if (victim) return victim;
/* 2. Look in CurrentPool for unpinned + pcm_state = N SCUR (already downgraded) */
victim = sweep_current_pool_shared();
if (victim) return victim;
/* 3. Classic clock-sweep fallback (XCUR; must release PCM X lock + flush dirty first) */
victim = sweep_current_pool_clock();
if (victim->pcm_state == X) {
pcm_release_x_lock(victim);
}
if (victim->dirty) flush_to_disk(victim);
return victim;
PI TTL and eviction: a PI buffer's pi_created_at field (offset 120, cache line 2) records the creation timestamp; after the default 5 minutes (cluster_pi_ttl_sec = 300) it is marked as an eviction candidate. PI is also cleaned up early in the following cases: the node re-acquires an X lock on the same block (the PI is then meaningless); after Phase 4 master reconstruction completes during Reconfig; or when the cluster_undo_retention_sec window closes and the associated undo data becomes invalid.
OLTP impact: after PIVOT B the hot path reads only cache line 1 (first 64 B), adding just 1 byte of pcm_state read + branch overhead (~5 ns) over PG native. The three-pool structure gives XCUR hot data priority retention, preventing full-table scans from polluting the working set. Overall OLTP TPS impact is < 1% (design analysis conclusion; Stage 1.6 empirical validation in progress).
For deeper design details and related features:
ClusterBufferDesc C struct, 5 StaticAssertDecl semantic constraints, three-pool GUC parameters (cluster_cr_pool_pct / cluster_pi_pool_pct), pg_cluster_buffer_pool_stats view field definitions, memory budget (BufferDesc array 128 MB increment = +0.8% shared_buffers)pcm_lock (LWLock at offset 104)BCT_INVALID buffer triggering CF transfer, RDMA zero-copy path and cf_state field lifecycleFlushBuffer path, the relationship between pi_lsn and WAL truncation point, and how checkpoint coordinates dirty buffer flush order across the three pools