This document is a deep-dive companion to Phase 4. It focuses on the internal mechanics that typically define the hard limits of peak systems.
It is intentionally technology-heavy. If you want the “why” first, start with:
4.D1 RPC and Service Framework Evolution (Why RPC becomes a platform)
At large scale, RPC is not only “a protocol.” It is an operating model:
- Timeouts, retries, back-pressure, and circuit breakers become consistent policy.
- Service discovery and routing become critical for locality and multi-active.
- Observability must be baked into the framework (trace context, structured logs, metrics).
What matters most:
- Stability under overload (avoid retry storms).
- Clear failure semantics (when to fail fast vs queue).
- Governance (contracts, versioning, rollout policies).
4.D2 Messaging at Peak Scale (Why MQ is a control plane)
Message queues become the pressure valve for peaks:
- Buffer burst traffic.
- Decouple slow consumers.
- Move non-critical work off the critical path.
Design concerns at scale:
- Topic/partition strategy and hot partitions.
- Consumer group lag monitoring and recovery playbooks.
- Idempotency and exactly-once effects (not necessarily exactly-once delivery).
- DLQs and reprocessing safety.
4.D3 Storage Engine Mechanics (Compaction, MVCC, write amplification)
At extreme QPS, storage engines and compaction policies can decide success or failure:
- Compaction affects write amplification, IO spikes, and tail latencies.
- MVCC affects read performance and garbage collection behavior.
- Index and schema design affect hot keys and contention.
The key operational lesson:
- Treat storage behavior as observable and testable under peak-like write patterns.
4.D4 Distributed Transactions (Sagas, compensation, and long-running correctness)
Peak commerce and finance often need multi-step workflows with failure modes:
- When not all steps can be a single ACID transaction, you need explicit orchestration.
- In practice, this often becomes saga-style: forward steps + compensations.
Success criteria:
- Idempotent handlers.
- Unique transaction identifiers.
- Replay safety and auditability.
- Clear “terminal states” and recovery procedures.
4.D5 Real-Time Risk Control (Rules + ML under strict latency)
Fraud and abuse are part of peak load. A large-scale risk system typically requires:
- Streaming and real-time feature pipelines.
- Fast inference paths (strict latency budgets).
- Rule systems for immediate controls plus ML for patterns.
- Feedback loops to adapt to new attack strategies.
The key is operational: a risk system must be deployable and observable like any other critical service.
Performance Numbers (How to read them)
Performance numbers are only useful if you understand:
- The workload shape (reads vs writes, hotspots, dependency depth).
- The failure semantics (what is allowed to fail, what must not).
- The operational model (how much of the “peak” is pre-warmed, cached, or buffered).
Next
If you want to map these ideas into a modern stack, continue to: