← Series hub ← PrevNext →

This document is a deep-dive companion to Phase 4. It focuses on the internal mechanics that typically define the hard limits of peak systems.

It is intentionally technology-heavy. If you want the “why” first, start with:

4.D1 RPC and Service Framework Evolution (Why RPC becomes a platform)

At large scale, RPC is not only “a protocol.” It is an operating model:

  • Timeouts, retries, back-pressure, and circuit breakers become consistent policy.
  • Service discovery and routing become critical for locality and multi-active.
  • Observability must be baked into the framework (trace context, structured logs, metrics).

What matters most:

  • Stability under overload (avoid retry storms).
  • Clear failure semantics (when to fail fast vs queue).
  • Governance (contracts, versioning, rollout policies).

4.D2 Messaging at Peak Scale (Why MQ is a control plane)

Message queues become the pressure valve for peaks:

  • Buffer burst traffic.
  • Decouple slow consumers.
  • Move non-critical work off the critical path.

Design concerns at scale:

  • Topic/partition strategy and hot partitions.
  • Consumer group lag monitoring and recovery playbooks.
  • Idempotency and exactly-once effects (not necessarily exactly-once delivery).
  • DLQs and reprocessing safety.

4.D3 Storage Engine Mechanics (Compaction, MVCC, write amplification)

At extreme QPS, storage engines and compaction policies can decide success or failure:

  • Compaction affects write amplification, IO spikes, and tail latencies.
  • MVCC affects read performance and garbage collection behavior.
  • Index and schema design affect hot keys and contention.

The key operational lesson:

  • Treat storage behavior as observable and testable under peak-like write patterns.

4.D4 Distributed Transactions (Sagas, compensation, and long-running correctness)

Peak commerce and finance often need multi-step workflows with failure modes:

  • When not all steps can be a single ACID transaction, you need explicit orchestration.
  • In practice, this often becomes saga-style: forward steps + compensations.

Success criteria:

  • Idempotent handlers.
  • Unique transaction identifiers.
  • Replay safety and auditability.
  • Clear “terminal states” and recovery procedures.

4.D5 Real-Time Risk Control (Rules + ML under strict latency)

Fraud and abuse are part of peak load. A large-scale risk system typically requires:

  • Streaming and real-time feature pipelines.
  • Fast inference paths (strict latency budgets).
  • Rule systems for immediate controls plus ML for patterns.
  • Feedback loops to adapt to new attack strategies.

The key is operational: a risk system must be deployable and observable like any other critical service.

Performance Numbers (How to read them)

Performance numbers are only useful if you understand:

  • The workload shape (reads vs writes, hotspots, dependency depth).
  • The failure semantics (what is allowed to fail, what must not).
  • The operational model (how much of the “peak” is pre-warmed, cached, or buffered).

Next

If you want to map these ideas into a modern stack, continue to: