GES (Global Enqueue Service) is pgrac's second cross-node coordination protocol, sitting alongside Cache Fusion. PCM locks govern concurrency on 8 KB blocks inside the buffer cache; GES governs every cross-node lock outside the buffer cache — row, table, transaction, object, advisory. The two share the GRD (Global Resource Directory) as their authoritative state store, but their keyspaces are fully isolated: PCM keys on BufferTag, GES keys on ClusterResId.
This chapter describes GES as it stands today in Stage 2: which lock classes go cluster-wide (only four: RELATION / TRANSACTION / OBJECT / ADVISORY — spec-2.14 L261-264), the internal layout of a GRD entry (spec-2.15), the static shard-to-master map (spec-2.14), the basic Grant / Convert / Release operations (spec-2.16), and the BAST cooperative release plus Tarjan SCC deadlock detection (spec-2.17 / 2.22). The lock classes actually enabled in Stage 2 are TX / TM / OBJECT (per spec 2.25 in the stage2-6-roadmap); the full 8-class GES matrix (adding SEQ / CF / UL / TT / IS / CI / XR / PG-specific) is Stage 5 scope (spec 5.1, Full GES 8-mode lock matrix).
This chapter uses pgrac naming (write-ahead invariant, spec-2.29; GRD entry, spec-2.15; ClusterResId, spec-2.14). Oracle-style GCS/GES bifurcation, LC Lock, and RC Lock do not exist in pgrac — see AD-011 (PG has no SGA shared pool; LC / RC Lock are not migrated; architecture-impact.md L386-387). pgrac reuses PG's own LOCKMODE 1..8 enum (NOT Oracle's NL/IS/IX/SS/SX/EX); conflict resolution calls DoLockModesConflict() from PG's lmgr/lock.c — see AD-012 (Cluster Visibility Path, architecture-impact.md L1141-1240).
PCM locks operate on BufferTag — physical blocks. But a large class of database coordination has nothing to do with the buffer cache:
LOCKTAG_TRANSACTION) must cross nodes.ALTER TABLE → DML on all other nodes must block (LOCKTAG_RELATION).pg_advisory_lock(key) → cluster-wide mutual exclusion, not single-node (LOCKTAG_ADVISORY).LOCKTAG_OBJECT.The common trait is that the lock object is a logical resource. Cramming these into PCM would dissolve PCM's invariant on physical block identity — for example, PCM's "X→S downgrade" would be misread as "row-lock downgrade," collapsing transaction isolation. The GCS / GES split is not an Oracle convention; it is a semantic necessity.
At the protocol level, pgrac GES marks only four LockTag classes as cluster-aware (spec-2.14 L261-264); all others return false from cluster_grd_locktag_is_cluster_aware() and follow the single-instance path.
| PG LockTag type | Cluster-aware? | Enabled in Stage 2? | Notes |
|---|---|---|---|
LOCKTAG_RELATION | Yes | Yes (TM) | Table / index locks; DDL × DML coordination |
LOCKTAG_TRANSACTION | Yes | Yes (TX) | Cross-node transaction visibility + row-lock wait |
LOCKTAG_OBJECT | Yes | Yes (OBJECT) | Extension-defined global objects |
LOCKTAG_ADVISORY | Yes | No (Stage 5) | pg_advisory_lock |
LOCKTAG_TUPLE / SPECULATIVE / VIRTUAL / OBJECT-CLUSTER_* | No | — | Local or non-conflict path |
The Stage 2 enabled subset is TX / TM / OBJECT, locked by spec 2.25 in the stage2-6-roadmap. LOCKTAG_ADVISORY is cluster-aware at the protocol layer but its actual enablement is deferred to Stage 5 (spec 5.1, Full GES 8-mode lock matrix). Other traditional Oracle concepts (LC / RC Lock / SEQ / CF / UL / TT / IS / CI / XR / PG-specific) do not exist in pgrac at the current stage — see AD-011.
Cross-node messages cannot just memcpy a PG LOCKTAG — LOCKTAG carries padding, node-specific fields, and no cross-compiler layout guarantee. spec-2.14 introduces ClusterResId: 16 bytes, fixed field order, wire-safe canonical identity.
typedef struct ClusterResId {
uint32 field1; // usually = LOCKTAG.locktag_field1 (dboid / relid / xid)
uint32 field2; // usually = LOCKTAG.locktag_field2
uint32 field3; // usually = LOCKTAG.locktag_field3
uint16 field4; // tuple offset etc.; **excluded** from shard hash
uint8 type; // one of PG LockTagType
uint8 lockmethodid; // PG lockmethod index
} ClusterResId; // exactly 16 bytes (spec-2.14 L199-217)
Key point: ClusterResId is not a memcpy of LOCKTAG. It is an independent structure derived canonically from LOCKTAG by cluster_grd_resid_from_locktag() at the cross-node boundary, so the wire layer never depends on PG's internal layout.
GRD uses 4096 fixed shards (PGRAC_GRD_SHARD_COUNT = 4096, spec-2.14 L196). Each ClusterResId hashes to a single shard, and the shard maps statically to one master node:
// shard_id (spec-2.14 L290-296, Q7 L70)
// hash input is the **first 14 bytes** of ClusterResId — field4 (tuple offset) is skipped
// so different tuples of the same row land in the same shard, aggregating row contention
shard_id = hash_bytes_extended(&resid, 14) % 4096;
// shard → master (spec-2.14 L307-316)
// declared_list = sorted(node_ids declared in cluster.conf)
// note: modulo is taken against len(declared_list), NOT cluster_node_id,
// because cluster.conf permits sparse node_ids (e.g. 1, 3, 7)
master[shard_id] = declared_list[shard_id % len(declared_list)];
Implementation: master[4096] is an array of pg_atomic_uint32 in the shmem region "pgrac cluster grd" (spec-2.14 L238). Routing lookups are lock-free — a single atomic load yields the master node_id.
The Stage 2 master map is statically declared — initialized once by cluster_grd_master_map_init() at postmaster startup and never modified. There is no advertise_master or transfer_ownership operation at runtime. Dynamic remastering (DRM) is Stage 6 scope (spec-2.14 L126); it requires explicit protocol hand-off and epoch coordination, which Stage 2 deliberately omits.
pgrac GES introduces no new lock-mode enum. It reuses PG's LOCKMODE 1..8 (AccessShareLock .. AccessExclusiveLock) directly, and conflict resolution calls PG's own DoLockModesConflict() from lmgr/lock.c (spec-2.16 Q2). This decision is AD-012 Cluster Visibility Path (architecture-impact.md L1141-1240): the single-instance PG lock-compatibility matrix is extended unchanged to cluster scope, so existing PG applications experience no surprising behavior differences when migrating to pgrac.
| LOCKMODE | Mode name | Typical trigger | Main conflicts |
|---|---|---|---|
| 1 | AccessShareLock | SELECT | AccessExclusive |
| 2 | RowShareLock | SELECT FOR UPDATE | Exclusive + |
| 3 | RowExclusiveLock | INSERT / UPDATE / DELETE | Share + |
| 4 | ShareUpdateExclusiveLock | VACUUM / ANALYZE | same level + |
| 5 | ShareLock | CREATE INDEX | RowExclusive + |
| 6 | ShareRowExclusiveLock | Triggers | RowExclusive + |
| 7 | ExclusiveLock | REFRESH MATERIALIZED VIEW CONCURRENTLY | RowShare + |
| 8 | AccessExclusiveLock | ALTER TABLE / DROP | All modes |
The local fast path still uses PG's native LOCALLOCK cache: when the node already holds a compatible mode, no network message is produced and latency is identical to single-instance PG. Only when there is a LOCALLOCK miss and the LockTag belongs to one of the four cluster-aware classes does GES enter the cross-node protocol.
The GRD is the authoritative state store for GES. Each shard is a dshash bucket, and each record within is a GrdEntry (spec-2.15 L332-344) — file-static, opaque, and not exposed by field; all access goes through cluster_grd_* APIs.
typedef struct GrdEntry { /* spec-2.15 L332-344 */
ClusterResId resid; // 16-byte key
slock_t lock; // entry spinlock
int ngranted; // 0 .. MAX_HOLDERS
ClusterGrdHolder holders[16]; // {node_id, mode, xid}
int nwaiters; // 0 .. MAX_WAITERS
ClusterGrdWaiter waiters[16]; // {node_id, mode, wait_start}
int nconverts; // 0 .. MAX_CONVERTS
ClusterGrdConvert converts[8]; // {node_id, current_mode, requested_mode}
uint64 last_modified_scn;
uint32 state_flags;
} GrdEntry;
| Constant | Value | Meaning |
|---|---|---|
PGRAC_GRD_MAX_HOLDERS | 16 | Max concurrent holders per resource |
PGRAC_GRD_MAX_WAITERS | 16 | Max wait-queue depth |
PGRAC_GRD_MAX_CONVERTS | 8 | Max in-flight conversion requests |
These caps are hard limits in Stage 2 (spec-2.15 L306-308); overflow returns FULL, and callers must retry or escalate.
The central entry point is cluster_grd_entry_lookup_or_create(resid, create, out), returning a 5-value enum (spec-2.15 L209-216, L417-498):
| Return | Value | Meaning |
|---|---|---|
OK | 0 | Hit or newly created; out holds entry reference |
NOT_READY | 1 | Shard not yet initialized (during reconfig) |
NOT_FOUND | 2 | create=false and entry absent |
FULL | 3 | Shard capacity exhausted |
ERROR | 4 | Invariant violation / internal error |
Every cross-node grant / convert / release / BAST path funnels through this entry point, so resid → entry resolution is idempotent and atomically protected (the entry's slock_t lock guards field reads / writes).
BAST stands for Blocking AST (AST = Asynchronous System Trap, ges-lock-protocol-design.md L91). BAST is the key design that distinguishes GES from traditional blocking locks — but the more important point is that BAST is a notification, not an authorization (spec-2.17 Invariant I63, L304): BAST informs the holder that "someone wants this lock," yet the holder keeps it until its transaction ends naturally. The master must not grant the lock to the requester just because it sent a BAST; the grant waits for an actual RELEASE.
GES_REQUEST (opcode 1) or GES_CONVERT (opcode 2) conflicts with an existing entry in holders[]. The master sends GES_BAST (opcode 4) to that holder.CLUSTER_GRD_PENDING_BAST_RECEIVED and calls SetProcSignal(PROCSIG_CLUSTER_GES_BAST).cluster_grd_bast_handler(), which only sets a bast_pending flag — it does not downgrade or release immediately.LockReleaseAll, the natural GES_RELEASE envelope (opcode 3) carries a logical BAST_ACK flag.holders[], re-evaluates waiters[] / converts[], and sends GES_GRANT to the next compatible waiter.Node 1 (requester) Node 2 (master) Node 3 (current holder)
| | |
|--- 1. REQUEST ---->| |
| |--- 2. BAST ------->| ProcessInterrupts
| | | sets bast_pending
| | | (continues txn)
| | |
| |<-- 3. RELEASE -----| commit / rollback
| | (BAST_ACK) |
|<-- 4. GRANT -------| |
A BAST timeout does NOT kill a healthy holder (Invariant I64, spec-2.17). If the holder's transaction is genuinely long-running, the master will not force a release even after a BAST has gone unanswered for a long time — preserving transaction semantics on a healthy node against accidental protocol-layer disruption. "BAST unanswered for too long" feeds the LMD (Lock Manager Daemon) deadlock detector as a signal, not as a preemption trigger.
Cross-node ProcSignals must guard against "delivering a signal to an already-exited backend" — a procno may have been reused by a fresh backend. spec-2.17 L170 Q7 defines a 6-tuple identity for BAST delivery; every field is required:
{ target_node_id, target_procno, target_generation,
request_seq, resid, mode, epoch }
target_generation is the procno-reuse counter, and epoch is the cluster epoch (spec-2.29, monotonically advanced across reconfiguration). Any mismatch → the BAST is discarded at the IC handler stage with a log entry and never reaches ProcessInterrupts.
| Path | Target latency | Notes |
|---|---|---|
| Conflict-free grant (Tier 1 RDMA) | ~5 μs | REQUEST → immediate GRANT, one round-trip |
| Conflict, holder about to commit | 10–20 μs | REQUEST → BAST → commit-time RELEASE → GRANT |
| Conflict, holder running long | Holder-bound | BAST is non-preemptive; possibly seconds |
Because BAST is cooperative, GES must independently detect circular waits. spec-2.22 assigns this to the LMD (Lock Manager Daemon) — every tick, it runs Tarjan's SCC algorithm (strongly connected components decomposition). It is a textbook algorithm, but pgrac's implementation departs from the textbook in two ways: an iterative version (non-recursive, avoiding stack overflow) and a snapshot decoupling (Tarjan never runs while holding the graph lock).
The cross-node wait-for graph lives in its own shmem region "pgrac cluster lmd graph" (spec-2.22) — deliberately separate from LMD daemon process-local state:
cluster_lmd_graph_add_edge() meaning "I (procno X on node A) am waiting on holder (procno Y on node B)."The price of this snapshot-based design is that the graph may be slightly stale, but the Tarjan output is always a valid SCC for some moment of the wait-for graph — if the deadlock actually resolves, the next tick observes it naturally.
Once Tarjan identifies an SCC (a cycle), one backend must be selected as the victim to break the cycle. The selection key follows spec-2.17 L186 Q16 / spec-2.20 Q6:
victim_key = ( cluster_epoch, // high bit: oldest epoch first (stale cycles)
local_start_ts_ms DESC,// primary: youngest backend wins (lowest rollback cost)
node_id, // tie-break 1
xid ) // tie-break 2
"Youngest" reflects the heuristic that young transactions have done the least work and are cheapest to roll back. PG's single-instance deadlock detector uses the same rule; pgrac extends it across the cluster.
When the victim is local to the LMD's node, LMD delivers a ProcSignal PROCSIG_CLUSTER_GES_CANCEL (spec-2.17 Q9) to the target backend. The backend observes the pending signal in ProcessInterrupts and ereport(ERROR, "deadlock detected") with SQLSTATE 40P01 (ERRCODE_T_R_DEADLOCK_DETECTED — PG's standard deadlock SQLSTATE).
Cross-node victim cancel forwarding (when the victim is on a remote node) belongs to spec-2.24 scope; Stage 2 today supports only local-victim direct cancellation. Multi-node victim delivery needs reliable cross-node cancel forwarding + acknowledgement, which spec-2.24 introduces in a later Stage 2 spec.
The core opcodes in GesRequestOpcode are listed below. The first three are the Stage 2 MVP basics; the rest are auxiliary paths.
| Opcode | Name | Spec | Purpose |
|---|---|---|---|
| 1 | REQUEST | spec-2.16 | First-time acquire (backend currently holds nothing on this resource) |
| 2 | CONVERT | spec-2.16 | Upgrade lock mode (already holds current_mode, asks for requested_mode) |
| 3 | RELEASE | spec-2.16 | Release on commit / rollback / explicit LockRelease |
| 4 | BAST | spec-2.17 | Master notifies holder "someone needs this lock" |
| 5 | BAST_ACK | spec-2.17 | Logical flag piggy-backed on RELEASE |
| 6 | DEADLOCK_PROBE | spec-2.22 | LMD probes cross-node wait-for edges |
| 7 | CANCEL_PENDING | spec-2.22 | Cancel pending request (local victim path) |
| 8 | DEADLOCK_REPORT | spec-2.22 | LMD reports the detected SCC |
Both eventually place the backend into holders[], but their starting points differ:
waiters[]; if not, it goes directly into holders[].converts[] queue (up to 8 entries); once the conflicting holders release, the master rewrites the mode field of that entry in holders[].This distinction is particularly important for deadlock detection — convert-wait and first-time-grant-wait are different edge types in the wait-for graph, and LMD must handle them separately.
spec-2.16 Q3 / spec-2.18 lock the rule: only the shard's master node may rewrite holders[] / waiters[] / converts[]. Non-master nodes forward every request to the master via IC; the master serializes them through the LMS (Lock Manager Server) inbound work_queue. Combined with the static master map from §3.3.1, this rule guarantees a single arbiter per resource cluster-wide.
pgrac's global correctness rests on the write-ahead invariant (spec-2.29): any node writing to shared storage must already hold the corresponding GES X lock, and the master must already have that holder recorded in the GRD. This is the joint write-correctness foundation for Cache Fusion and GES — violating it causes lost updates or dirty reads. Stage 2 enforces it through two paths:
cluster_qvotec_in_quorum() and validates holder state before writing WAL (see Chapter 5).PROCSIG_CLUSTER_FREEZE_WRITES (see Chapter 5 §5.5.1).This chapter covers only the acquire / release semantics of GES locks themselves; the end-to-end enforcement of the write-ahead invariant is covered in Chapter 5.
For deeper protocol detail, see:
Chapter 4 — SCN describes how every cross-node message (including GES envelopes) piggy-backs the current SCN to preserve causal ordering and consistent reads. Chapter 5 — Reconfiguration describes how GRD shards / the master[] map are rebuilt, holders re-published, and orphan locks reclaimed when cluster topology changes.