Phase 3: Operations Playbook

← Series hub ← Prev • Next →

This phase is about how peak performance becomes repeatable. The core claim: peaks are won in operations, not heroics.

3.1 Capacity Planning

Capacity planning for peak events is fundamentally a prediction problem under uncertainty.

Common patterns in mature peak readiness:

Peak forecasting using historical curves + product plans + marketing inputs.
Safety margins based on known unknowns (dependencies, tail latencies, cache miss storms).
Bottleneck-first reviews: DB, network, MQ, hot keys, and “global” dependencies.
Explicit downgrade plans: define what can be turned off to protect the payment core.

Good capacity planning is not a spreadsheet; it is a living process with drill-down ownership.

3.2 Full-Link Stress Testing (Making confidence deterministic)

Component tests do not predict peak behavior. Full-link testing validates the real production path:

The same service graph.
The same dependency surfaces.
Production-like concurrency and hotspots.

What full-link testing usually requires:

Traffic replay or high-fidelity traffic generation.
Strict data isolation (shadow tables / isolated accounts / isolation layers).
End-to-end observability to detect where time and failures go.

The operational payoff:

Catch systemic issues early (cascades, timeouts, retries, “thundering herds”).
Turn pre-event readiness from “hope” into measurable confidence.

3.3 Incident Command and Monitoring

At peak scale, the goal is not “no incidents.” The goal is:

Detect fast
Classify correctly
Mitigate safely
Recover predictably

A strong incident setup typically includes:

A command center operating model (roles, comms, escalation paths).
Golden-signal dashboards (latency, traffic, errors, saturation) + business KPIs.
Runbooks for known failure classes (DB pressure, MQ lag, cache failure, hot partitions).
Post-incident reviews that produce permanent fixes (not blame).

3.4 Downgrade and Degrade Strategies

During extreme load, protecting the payment core often means turning optional paths into best-effort:

Disable non-critical features (recommendations, long-tail queries, expensive joins).
Switch to cached or approximated results.
Shed traffic by tiers (rate-limits, queueing, admission control).

The key is that degrade plans must be:

Pre-defined,
Tested in drills,
Observable in dashboards,
Reversible.

3.5 Operational Checklist (Template)

Pre-event (T-30 days)

Confirm peak forecast + assumptions.
Run full-link tests and record bottleneck owners.
Freeze risky changes; define release gates and rollback rules.
Validate degrade plans and toggles.

Pre-event (T-7 days)

Re-run full-link tests with the final traffic model.
Drill incident command and escalation.
Validate DB/MQ capacity and failover.
Validate dashboards, alerts, and on-call coverage.

During event (T-0)

Monitor golden signals + business KPIs continuously.
Apply throttles/degrades early (don’t wait for collapse).
Track anomalies and mitigation actions in a shared log.

Post-event (T+1 day)

Postmortems with action items and owners.
Compare forecast vs actual; update models.
Backlog the hard fixes (hot keys, global dependencies, unsafe retries).

Key Takeaways

Automation is a force multiplier: it reduces peak readiness headcount while improving reliability.
Full-link testing is the confidence engine: it converts uncertainty into engineering work.
Peak operations is a product: dashboards, checklists, drills, and runbooks are first-class artifacts.

3.1 Capacity Planning#

3.2 Full-Link Stress Testing (Making confidence deterministic)#

3.3 Incident Command and Monitoring#

3.4 Downgrade and Degrade Strategies#

3.5 Operational Checklist (Template)#

Pre-event (T-30 days)#

Pre-event (T-7 days)#

During event (T-0)#

Post-event (T+1 day)#

Key Takeaways#