pgrac layers a complete cluster resident-process system on top of PostgreSQL's process model. The postmaster fork mechanism remains unchanged; the BackendType enum gains 16 new values appended to the existing set. At steady state, a 4-node primary cluster runs approximately 26 processes per node — 10 PG-native processes (several with targeted adjustments) plus 16 new pgrac cluster daemons.
Understanding the process tree is the first step in diagnosing cluster problems: the backend_type column in pg_stat_activity maps directly to the BackendType enum, and the names visible in ps aux output — postgres: lms 0, postgres: lmd, postgres: cssd, and so on — correspond one-to-one with the design documents. This chapter builds a comprehensive mental model of the process tree: what each process is, why it exists as a separate process, the order in which processes start and shut down, and the mechanisms they use to communicate. Internal protocol details (GES message encoding, PCM state-machine transitions, RDMA QP management) are left to deep-dive pages; this chapter establishes only the structural vocabulary.
The process roster, aux numbering, SQLSTATEs, and GUC defaults cited in this chapter come from Stage 2 spec text (spec-2.5 CSSD / spec-2.6 qvotec / spec-2.18 LMS / spec-2.19 LMD / spec-2.20 LockAcquire S1–S7, etc.). The terms LMHB (Lock Manager HeartBeat) and GRD0 daemon that appeared in earlier drafts are not pgrac concepts and are not used here — heartbeat duties live in CSSD (aux #5), and GRD state is co-maintained by LMS / LMD / LMON without a dedicated GRD daemon.
postmaster is the root of the entire process tree. After shared memory and lock structures are initialized, it forks all child processes in four sequential Phases. Child processes communicate with each other via shared memory, signals, and the inter-node Interconnect — there is no direct parent–child dependency chain between processes (postmaster is a supervisor, not an intermediate router).
postmaster
│
┌───────────────┴───────────────┐
│ │
PG natives pgrac new
─────────── ──────────────
walwriter (+BOC) LMS × N (default 4)
bgwriter LMD
autovacuum LCK
checkpointer LMON
archiver CSSD
logical_repl qvotec
... Interconnect Listener
Undo Cleaner / TT GC
DIAG / Cluster Stats
Sinval Broadcaster
Recovery Coord / Worker
MRP
Several processes on the "PG natives" side carry pgrac-specific adjustments (see §8.2) — they do not retain entirely stock behavior. Every process on the "pgrac new" side is wholly new, appended to the BackendType enum beyond the PG-native values, without disturbing any existing ABI.
Steady-state process count estimate: 4 LMS workers + 1 LMD + 1 LCK + 1 LMON + 1 CSSD + 1 qvotec + 1 Interconnect Listener + 1 Undo Cleaner + 1 TT GC + 1 DIAG + 1 Cluster Stats + 1 Sinval Broadcaster = 15 mandatory pgrac processes; plus approximately 10 PG-native processes, for a total of roughly 25–26 processes / node (primary, steady state). Recovery Coordinator / Recovery Worker / MRP are not counted in steady state.
Recovery Coordinator, Recovery Worker (dynamic count), and MRP are not counted in the steady-state figure. Recovery Coordinator / Worker are forked on demand during reconfiguration only and exit when finished; MRP starts only in standby + ADG mode. CSSD (aux #5) replaces what earlier drafts called a separate Heartbeat process: CSSD both declares per-peer tri-state (ALIVE / SUSPECTED / DEAD) and broadcasts the application-level heartbeat envelope — see §8.3 and Chapter 5 §5.3.1.
All PG-native processes are retained; several carry targeted extensions. The extension principle: if additional logic can be embedded in an existing process, no new process is created (BOC embedded in walwriter is the canonical example). Extensions take effect only in cluster mode — the single-instance code path is unchanged.
| PG-native process | Adjustment | Notes |
|---|---|---|
postmaster | Initializes GES client at startup; registers with GRD; supervises all pgrac processes; decides restart vs. instance crash on child death | Core supervisor |
walwriter | Embeds BOC: 100 μs flush cycle; SCN piggyback maintenance; per-thread WAL stream handling | Most significant adjustment — see note below |
bgwriter | PCM state check before writing dirty blocks: flush only in X mode; skip if not X mode (let the PCM master coordinate) | Prevents cross-node buffer conflicts |
checkpointer | Cross-node barrier checkpoint; triggers cluster checkpoint barrier | Related to #18 |
archiver | Per-thread WAL archiving; thread_id-isolated archive paths; completion reported to GRD | Related to AD-009 |
autovacuum launcher | XID wraparound calculation is aware of per-instance XID segmentation (AD-012 exception 10) | Logic unchanged; boundary-aware |
startup | One-shot recovery only: crash recovery entry point; detects merged-recovery requirement; triggers Recovery Coordinator; exits after completion. No longer responsible for continuous standby apply | Continuous apply delegated to MRP |
walsender | Retained, no changes | — |
walreceiver | Per-thread receive under ADG | — |
logger | Retained, no changes | — |
logical rep launcher / worker | Retained, no changes | — |
walwriter-embedded BOC is the implementation host for the "BOC flush" described in the SCN chapter (§3.2). BOC fires frequently (every 100 μs) but each invocation does minimal work — a separate process would cost more than it saves. BOC is also tightly coupled to WAL flush timing (after commit, BOC advances the SCN), so embedding preserves temporal consistency. Oracle's BOC is likewise an embedded responsibility of LGWR; the design stays aligned.
Important change to the startup process: The PG-native startup process, in standby mode, continuously applies WAL until promote. That behavior is delegated to MRP in pgrac. pgrac's startup performs only the startup-time crash recovery and exits afterward — a single, narrow responsibility that simplifies fault isolation.
pgrac adds 16 categories of background process, organized into 5 subsystem groups. The table below gives each process's name, aux number, steady-state count, and one-line responsibility. Detailed design is covered in §8.6 (IPC Model) and the per-feature deep-dive pages.
| # | Subsystem | Process (aux #) | Steady-state count | One-line responsibility |
|---|---|---|---|---|
| 1 | Lock & Cache | LMS (Lock Master Service, aux #7, spec-2.18) | N=4 (default; GUC cluster.lms_workers tunable 1–16) | Handles cross-node PCM/GES remote requests, responds to buffer-ship requests, executes lock grant / revoke decisions, carries SCN piggyback. |
| 2 | Lock & Cache | LMD (Lock Manager Daemon, aux #8, spec-2.19) | 1 | Consumes the local-node wait-for graph, runs Tarjan SCC deadlock detection, selects victims and cancels them via ProcSignal. |
| 3 | Lock & Cache | LCK (Lock Process) | 1 | Holds instance-level locks (dictionary lock, cluster catalog lock), preventing LMS workers from being long-blocked by instance locks. |
| 4 | Lock & Cache | LMON (Lock Monitor) | 1 | Monitors cluster node state, coordinates Reconfiguration (spec-2.29 coordinator tick), triggers GRD rebuild and fence decisions. |
| 5 | Cluster Comms | CSSD (Cluster Sync Service Daemon, aux #5, spec-2.5) | 1 | Declares each declared peer's ALIVE / SUSPECTED / DEAD state; broadcasts the application-level heartbeat envelope every 1 s; populates pg_cluster_cssd_peers. |
| 6 | Cluster Comms | qvotec (Quorum Voting Coordinator, aux #6, spec-2.6) | 1 | Arbitrates voting-disk quorum; polls disk slots periodically; maintains the cluster_qvotec_in_quorum() predicate and lease; populates pg_cluster_quorum_state / pg_cluster_voting_disks. |
| 7 | Cluster Comms | Interconnect Listener | 1 | Listens on RDMA QP / TCP fallback port, receives messages and dispatches them to LMS / LMD / LCK worker queues. |
| 8 | Undo / TT | Undo Cleaner | 1 | Scans local instance undo segments every 30 s, reclaims RECYCLABLE space, maintains the retention window, advances the WRAP counter. |
| 9 | Undo / TT | TT GC (Transaction Table GC) | 1 | Scans TT slots every 10 s, reclaims expired slots whose commit_scn has been surpassed by the cluster-wide oldest_active_scn for reuse by new transactions. |
| 10 | Observability | DIAG | 1 | Cross-node diagnostic snapshots: detects long-waits (default 60 s) and triggers hang dumps, receives diagnostic requests from other nodes, aggregates cluster logs. |
| 11 | Observability | Cluster Stats | 1 | Samples cluster metrics every 10 s; populates the sampling views pg_stat_cluster_wait_events / pg_stat_cluster_wait_events_history (default 7-day retention). |
| 12 | Observability | Sinval Broadcaster | 1 | Batch-broadcasts local-node catcache / relcache invalidation messages to all other nodes and injects them into the peer sinval queue, maintaining catalog consistency. |
| 13 | Cluster Recovery | Recovery Coordinator | 1 (reconfig only) | Collects WAL from the dead node, coordinates k-way SCN merge, allocates Recovery Workers, coordinates PCM lock-state restoration; exits when complete. |
| 14 | Cluster Recovery | Recovery Worker | M dynamic (reconfig only) | Receives WAL segments assigned by the Coordinator, executes redo / undo apply, reports progress; exits when complete. |
| 15 | Cluster Recovery | MRP (Managed Recovery Process) | 1 (standby + ADG only) | Continuously receives the per-thread WAL stream from walreceiver, applies it centrally (aligned with Oracle's MRP model), advances apply_scn; exits on promote. |
| 16 | — | (Reserved slot) | — | Reserved enum tail position for Stage 3+ daemons to be appended without breaking ABI. |
View naming convention: pg_cluster_* (no stat_) is reserved for state / registry views — pg_cluster_nodes / pg_cluster_cssd_peers / pg_cluster_quorum_state / pg_cluster_voting_disks / pg_cluster_fence_state / pg_cluster_reconfig_state; pg_stat_cluster_* is reserved for sampling / performance views — pg_stat_cluster_wait_events / pg_stat_cluster_wait_events_history / pg_stat_cluster_workers. The naming split reflects the semantic boundary between snapshot state and cumulative statistics; operational SQL should not conflate them.
Sinval Broadcaster is a critical safety process: after a crash, postmaster restarts it immediately; more than 3 restarts → instance crash. catcache / relcache inconsistency is a data-correctness issue, not a performance issue — it cannot be degraded. CSSD and qvotec are likewise critical processes and follow the same fail-closed escalation path on crash.
BackendType enum extension: The new processes correspond to enum values appended to miscadmin.h (B_CLUSTER_STATS / B_CSSD / B_DIAG / B_INTERCONNECT / B_LCK / B_LMD / B_LMON / B_LMS_WORKER / B_MRP / B_QVOTEC / B_RECOVERY_COORD / B_RECOVERY_WORKER / B_SINVAL_BCAST / B_TT_GC / B_UNDO_CLEANER), appended after the existing values without altering any existing value, maintaining PG 16.13 ABI compatibility. The backend_type column in pg_stat_activity displays them automatically; no changes to the view layer are required.
LMS (Lock Manager Server, aux #7, spec-2.18 §1.4) and LMD (Lock Manager Daemon, aux #8, spec-2.19) are the two postmaster-forked daemons of the pgrac global lock subsystem (spec-2.18 Q5). They move the old "caller synchronously waits + LMON tick moonlights as grant decider" implementation to a stable shape: "caller asynchronously enqueues + dedicated daemon serially decides." Ownership is single, behavior is fail-closed, and there is no runtime fallback path.
Each daemon takes over a critical path from a temporary host:
lms_state to LMS_READY, the corresponding branch in the LMON tick returns early; from that tick onwards, all inbound GES requests are consumed by the work_queue owned by LMS, with only one owner at any moment (D1).Both migrations are single-ownership + fail-closed: if cluster.lms_enabled = on but LMS is not ready (fork failed, heartbeat timed out, shmem not mapped), caller-side LockAcquireExtended raises ereport(ERROR, '53R80', 'cluster_lms_unavailable') directly in S1 (spec-2.18 Q12) — it does not fall back to the LMON tick path. LMD is symmetric: when not ready, wait-for graph writers receive 53R81 cluster_lmd_unavailable (spec-2.19 Q12). These two SQLSTATEs pair with the two PGC_POSTMASTER GUCs cluster.lms_enabled (spec-2.18 Q10) / cluster.lmd_enabled (spec-2.19 Q10); changing them requires an instance restart.
Once spec-2.20 is activated, caller-side LockAcquireExtended on the should-globalize path unfolds into a seven-step state machine. The seven steps are not internal daemon transitions but code regions the caller traverses while waiting for LMS to decide; each step corresponds to a specific partition LWLock acquire/release, PROCLOCK table mutation, and message round-trip with LMS.
| Step | Name | Key action |
|---|---|---|
| S1 | should_globalize gate | Entry check. If the resource class is not cross-node / LMS not ready / fast-path hit → early return or raise 53R80. |
| S2 | LOCALLOCK reentrant | Reentrant acquire via GrantLockLocal(locallock, owner) + early return. A bare ++nLocks is not permitted — it would corrupt owner accounting. |
| S3 | partition LWLock + PROCLOCK reservation | Acquire the lock partition LWLock; reserve a PROCLOCK entry (placeholder) without calling GrantLock; release the partition LWLock. |
| S4 | async enqueue + wait | Asynchronously enqueue GES_REQUEST onto the LMS work_queue; WaitLatch for the LMS callback, bounded by a timeout. |
| S5 | GRANT callback handling | Re-acquire the partition LWLock and recheck conflict. Success → GrantLock and set grd_registered = true. Still conflict → WaitOnLock wrapped in PG_TRY(). Failure → roll back via GES_RELEASE. |
| S6 | release | Normal release path; symmetrically tears down the remote holder via GES. |
| S7 | cleanup | Handles REJECT / TIMEOUT / cancel: drops the reservation, reclaims resources, issues GES_RELEASE if needed. |
Critical invariants (spec-2.16 L232 / L242):
GrantLock is called only in S5; between those two steps the partition LWLock has already been released, and the GES wait loop never runs while holding the partition LWLock. Promoting GrantLock back into S3 would amount to initiating a cross-node wait under a held LWLock, violating the core constraint that an LWLock must not be held across CPU-preemption boundaries.WaitOnLock in S5 must be wrapped in PG_TRY() / PG_CATCH(). A cancel or SIGTERM-driven longjmp skips the normal cleanup path; the catch block must explicitly GES_RELEASE any remote holder that was already granted, otherwise a ghost holder is left in the GRD. Together with I45 this defines the atomic-rollback semantics of the S3–S5 region.spec-2.20 Q1 offers a second reading: the same S1–S7 numbering aligns with Oracle's classic 7 DLM lock states — STARTING / CONVERT / ACQUIRED / CONVERTING / RELEASED / CANCELED / COMPLETED. pgrac chooses external behavior aligned with Oracle (so Oracle-trained operators can transfer diagnostic intuition) while keeping internal names as S1–S7 numbers (so the spec can cross-reference them concisely). Both naming schemes describe the same state machine: caller-side S1–S7 is its code-region mapping inside a PostgreSQL backend, the Oracle 7-state vocabulary is its naming at the DLM protocol layer.
The seven steps span three actors: the caller backend, the LMS daemon, and the LMD daemon.
| Owner | Steps | Notes |
|---|---|---|
| caller backend | S1 / S2 / S3 / S4 wait / S5 callback / S6 / S7 | All partition LWLock operations, PROCLOCK table mutations, and LOCALLOCK accounting run in caller context. |
| LMS daemon | S4 work_queue consumer + grant decision body | LMS owns the inbound work_queue consumer and the GES grant-decision body (spec-2.18 D1); the outcome wakes the caller via ProcSignal + latch. |
| LMD daemon | Cuts across S4–S5 | LMD owns the wait-for graph, Tarjan SCC, and victim selection (spec-2.19 D1, spec-2.22 D2–D3); when a cycle is found it cancels the chosen caller via ProcSignal, which longjmps out of S5 WaitOnLock into S7 cleanup. |
The reverse channel from LMS / LMD to the caller is ProcSignal. Two new signals were pre-reserved in procsignal.h well in advance:
PROCSIG_CLUSTER_GES_BAST (spec-2.17 Q8) — LMS notifies the caller that it must downgrade or release a remote holder it owns locally. Blocking-AST semantics; the caller handles it inside ProcessInterrupts.PROCSIG_CLUSTER_GES_CANCEL (spec-2.17 Q9) — used by LMD when deadlock detection hits a cycle, or by the caller's own cancellation path; triggers a longjmp out of the S5 WaitOnLock.Backend exit cleanup: spec-2.17 introduces a new single-point entry, cluster_grd_cleanup_on_backend_exit(procno), that covers all single-backend exit paths — client CANCEL, external SIGTERM, the on_proc_exit callback chain, and self-abort (ereport(FATAL)). This entry explicitly excludes BAST timeout (spec-2.17 Q21): BAST timeout takes the reconfig / fence path, not the single-backend exit path, and reusing the cleanup routine would incorrectly treat a node-wide holder as a single-backend holder and GES_RELEASE it.
Runtime fallback between LMS and LMD is prohibited — there is no "LMD covers grants while LMS is busy" or "caller synchronously detects deadlocks when LMD fails." Single-ownership + fail-closed is a core invariant of spec-2.18 / spec-2.19: any fallback would create two simultaneous owners for grant-decision or deadlock-detection, leading to GRD-state forks or duplicate victim cancellation. The only legitimate way to disable LMS / LMD is to change the PGC_POSTMASTER GUCs cluster.lms_enabled / cluster.lmd_enabled and restart the instance.
The startup sequence reflects the cluster dependency chain: networking and heartbeats must exist before the lock service; the lock service must exist before recovery; recovery must complete before client connections are accepted.
postmaster
│
├── Phase 0: Foundation
│ └─ logger (logging first)
│
├── Phase 1: Cluster foundation ← 60 s timeout; failure → instance crash
│ ├─ Interconnect Listener (network layer ready)
│ ├─ CSSD (peer tri-state + heartbeat broadcast)
│ ├─ qvotec (voting-disk quorum arbitration)
│ └─ LMON (join cluster / GRD sync)
│
├── Phase 2: Lock service ← 30 s timeout; failure → instance crash
│ ├─ LMS0..LMSn (parallel fork; LMS_READY → LMON tick early returns)
│ ├─ LMD (wait-for graph + Tarjan SCC ownership transfers in)
│ └─ LCK
│
├── Phase 3: Recovery (on demand) ← 600 s timeout (GUC cluster.recovery_timeout)
│ ├─ startup process (crash recovery entry point)
│ │ ├─ detect merged recovery → LMON launches Recovery Coordinator
│ │ ├─ Recovery Coordinator → spawn Recovery Workers
│ │ └─ startup / Coordinator / Workers all exit when complete
│ └─ [standby + ADG only] MRP starts
│
└── Phase 4: Normal service ← 30 s timeout; single-process failure restarts 3×
├─ checkpointer / bgwriter / walwriter (with embedded BOC)
├─ archiver / autovacuum launcher
├─ TT GC / Undo Cleaner / Sinval Broadcaster
├─ DIAG / Cluster Stats / logical rep launcher
└─ begin accepting client connections
Three critical dependency arrows:
Shutdown is the reverse: The shutdown order is the inverse of startup. The critical invariant is "global locks must be released first (Phase 2 processes shut down) before the network is torn down (Phase 1 processes shut down)" — reversing this order leaves other nodes unable to detect the lock release, causing cluster state inconsistency.
Shutdown order:
1. Reject new connections
2. Wait for client backends to exit (default 30 s)
3. Phase 4 processes (Cluster Stats / DIAG / Sinval Broadcaster / Undo Cleaner / TT GC / archiver / autovacuum)
4. walwriter / bgwriter / checkpointer (final checkpoint)
5. [standby] MRP
6. Phase 2 processes (LCK / LMD / LMS0..LMSn) ← release global locks
7. Phase 1 processes (LMON / qvotec / CSSD / Interconnect Listener) ← notify graceful leave
8. logger
9. postmaster exits
pgrac inter-process communication operates in two layers: same-node processes rely on shared memory + signals + in-process queues; cross-node processes rely on the message queues dispatched by the Interconnect Listener. The boundary between the two layers is explicit — there is no design that "directly accesses shared memory across nodes." All cross-node data access goes through protocol messages (PCM block ship / GES lock grant).
Same-node IPC:
| Mechanism | Purpose |
|---|---|
| Shared memory (SysV / mmap) | Lock structures, buffer pool, TT slots, GRD cache |
| Signals (SIGTERM / SIGUSR1 / SIGUSR2) | postmaster → child process control (identical to PG-native behavior) |
ProcSignal (PROCSIG_CLUSTER_GES_BAST / PROCSIG_CLUSTER_GES_CANCEL / PROCSIG_CLUSTER_FREEZE_WRITES, etc.) | Reverse channel from LMS / LMD / LMON to backends; consumed inside ProcessInterrupts |
Latch (SetLatch) | Wake waiting backends / workers (reuses PG mechanism) |
| In-process queue (lock-free ring buffer) | LMS dispatcher → LMS worker (see §8.6.1) |
Cross-node IPC:
| Mechanism | Purpose |
|---|---|
| Interconnect (RDMA / TCP) | All cross-node protocol messages (PCM / GES / SCN / CSSD heartbeat) |
| Listener → worker queue | Interconnect Listener dispatches inbound messages by resource hash |
| Worker → Listener queue | Workers deliver outbound messages to the Listener for unified sending |
LMS is the highest-concurrency component in the process tree: N workers (default 4) share a single Interconnect Listener entry point, but each worker handles an independent subset of resources — there is no inter-worker lock contention.
Sharding strategy: worker_id = hash(resource_id) % N. For PCM, resource_id is the three-tuple (tablespace_oid, relfilenode, block_no); for GES it is the lock resource name. The same resource is always handled by the same worker, preventing concurrent races.
Message flow:
Cross-node message arrives
│
Interconnect Listener (single inbound point)
│
├─ read msg.resource_id
├─ compute worker_id = hash(resource_id) % N
└─ deliver to workers[worker_id].queue (lock-free ring buffer)
LMS worker inner loop:
while (running):
msg = my_queue.recv() # blocking wait
handle_pcm_or_ges_msg(msg) # PCM state machine / GES grant
update_local_scn(msg.piggyback_scn) # Lamport advance
if (reply needed):
outbound_queue.send(reply) # hand to Listener for sending
The key property of this design: the Listener is a single-threaded fan-out; each worker is a single-threaded serial processor of its own queue. There is no scenario in the system where multiple writers compete for the same GRD entry (all messages for a given resource are serialized to the same worker), so RDMA write operations require no additional per-resource locking.
Choosing N: The default N=4 corresponds to approximately 1.0–2.0 CPU cores for a 4-node OLTP cluster at 100K TPS. When N is too small, worker queues accumulate (monitor pg_stat_cluster_workers.queue_depth); when N is too large, LRU cache sharding loses efficiency (each worker caches fewer GRD entries). Production tuning guidance: consult the queue_depth field in the pg_stat_cluster_workers view.
Failure classification: LMS / LMD / LCK / LMON / CSSD / qvotec / Interconnect Listener / Sinval Broadcaster are critical processes — after a crash, postmaster restarts them; more than 3 restarts → instance crash (fenced by other nodes). Undo Cleaner / TT GC / DIAG / Cluster Stats are gracefully-degradable processes — after a crash, postmaster restarts them; more than 3 restarts → WARNING only, no instance crash (GC falls behind or monitoring degrades, but cluster correctness is unaffected).
For deeper protocol details, refer to the following resources:
BackendType enum definitions, pg_stat_cluster_workers view fields, per-process failure-decision table, full GUC parameter list (cluster.lms_workers / cluster.lms_enabled / cluster.lmd_enabled / cluster.recovery_timeout / cluster.cssd_heartbeat_interval_ms, etc.)background-process-design.md §2.1 / spec-2.5 / spec-2.6 / spec-2.16 / spec-2.17 / spec-2.18 / spec-2.19 / spec-2.20 / spec-2.22 — full specification for 16 new process types + 7 adjusted PG-native processes, S1–S7 state-machine invariants I45 / I52, ProcSignal reverse channel, single-point backend exit cleanup entry