This is a structured research series on how Alipay scaled Double 11 from early constraints to planet-scale reliability and throughput. It is organized as a hub + phases, so you can read it like a short book.

Reading Paths

Executive overview (10–15 minutes)

  1. Executive Summary

Engineering leadership (60–90 minutes)

  1. Executive Summary
  2. Phase 1 — Timeline
  3. Phase 2 — Architecture
  4. Phase 3 — Operations
  5. Phase 5 — Synthesis

Full technical deep dive (6–10 hours)

Read everything above, then:

  1. Phase 4 — Technology (Overview)
  2. Modern Tech Comparison
  3. Phase 4 — Deep Dive

Series Contents

Executive Summary: Alipay Double 11 Architecture

← Series hub Next → From 50M CNY to 544K TPS: Lessons in Building Planet-Scale Systems TL;DR From 2009 to 2019, Alipay increased peak capacity by ~5,440x and reached 544,000 transactions per second (TPS) while maintaining financial-grade reliability (99.99%) and zero data loss targets (RPO = 0). Three practical lessons stand out: Design to split (unitization): you cannot scale a monolith forever. Make confidence deterministic (full-link stress testing): stop guessing and test production paths safely. Automate operations end-to-end: scale peak readiness without scaling headcount. The Story: From Crisis to Record 2012: The Breaking Point The system hit multiple hard limits at once: ...

May 2, 2026 · 4 min · Lê Tuấn Anh

Modern Tech Comparison

← Series hub ← Prev • Next → This page maps the ideas from the Double 11 architecture story to modern cloud-native equivalents. The goal is not “copy Alipay tooling,” but “copy the patterns.” 1) LDC / Unitization vs Kubernetes Multi-Cluster (Cells) LDC / unitization is conceptually similar to cell-based architecture: Each cell/unit is self-contained for a slice of users/traffic. You scale by adding cells, not by stretching a shared core. Failures are contained where possible. Modern equivalents: ...

May 2, 2026 · 3 min · Lê Tuấn Anh

Phase 1: Timeline and Scale Evolution

← Series hub ← Prev • Next → Overview Double 11 (Singles’ Day) began in 2009 and grew into one of the largest online commerce peak events in the world. For Alipay, it became an annual forcing function: every year demanded new throughput, lower latency, and higher certainty under extreme load. This phase captures the scaling journey as a sequence of constraints and responses: from early manual scaling, to an architecture reset (unitization/LDC), to deterministic operations (full-link stress testing), and finally to cloud-native efficiency. ...

May 2, 2026 · 3 min · Lê Tuấn Anh

Phase 2: Core Architecture (LDC, Unitization, Multi-Active)

← Series hub ← Prev • Next → This phase focuses on the architecture that enables peak scale while preserving correctness and operational control. 2.1 LDC and Unitization (Cell Architecture) The core idea: a “unit” A unit is a self-contained slice of the system that can handle end-to-end business flows for a subset of users/traffic. Complete in services: the unit has the full set of required services. Partial in data: data is sharded so each unit owns a subset (e.g., by user-id range). The key goal is horizontal scaling with isolation: add units to add capacity, and contain failures within a unit when possible. ...

May 2, 2026 · 4 min · Lê Tuấn Anh

Phase 3: Operations Playbook

← Series hub ← Prev • Next → This phase is about how peak performance becomes repeatable. The core claim: peaks are won in operations, not heroics. 3.1 Capacity Planning Capacity planning for peak events is fundamentally a prediction problem under uncertainty. Common patterns in mature peak readiness: Peak forecasting using historical curves + product plans + marketing inputs. Safety margins based on known unknowns (dependencies, tail latencies, cache miss storms). Bottleneck-first reviews: DB, network, MQ, hot keys, and “global” dependencies. Explicit downgrade plans: define what can be turned off to protect the payment core. Good capacity planning is not a spreadsheet; it is a living process with drill-down ownership. ...

May 2, 2026 · 3 min · Lê Tuấn Anh

Phase 4: Deep Dive (Technology Internals)

← Series hub ← Prev • Next → This document is a deep-dive companion to Phase 4. It focuses on the internal mechanics that typically define the hard limits of peak systems. It is intentionally technology-heavy. If you want the “why” first, start with: Executive Summary Phase 2: Architecture Phase 3: Operations 4.D1 RPC and Service Framework Evolution (Why RPC becomes a platform) At large scale, RPC is not only “a protocol.” It is an operating model: ...

May 2, 2026 · 3 min · Lê Tuấn Anh

Phase 4: Technology Overview

← Series hub ← Prev • Next → This phase describes the main technology layers often referenced in Double 11 discussions: middle platform, risk control, payment processing, and the distributed application stack. 4.1 “Middle Platform” (Platform as a Reusable Layer) The “middle platform” idea can be summarized as: Build a reusable, standardized platform layer for common capabilities (data, risk, identity, payments, messaging, observability). Let product teams build “front platforms” faster by reusing shared platform primitives. The value proposition is speed and consistency: ...

May 2, 2026 · 2 min · Lê Tuấn Anh

Phase 5: Synthesis and Lessons Learned

← Series hub ← Prev • Next → This phase consolidates the series into reusable lessons. Treat it as the “what to copy” section. 5.1 Decision Timeline (What changed and why) A simplified view of the decade-long evolution: Distributed architecture foundation: a prerequisite for sustained scaling. Unitization / LDC: the key unlock that turns vertical ceilings into horizontal growth. Automated full-link stress testing: converts uncertainty into deterministic confidence. Elastic architecture: shifts peak preparedness from year-round reservation to controlled elasticity. Cloud-native era: standardizes delivery and improves operational efficiency. 5.2 Patterns (What worked) 1) Unitization (Cells / Shards with service boundaries) Split work and state into independent units. Route traffic to maintain locality. Scale by adding units, not stretching shared cores. 2) Deterministic readiness (Full-link testing) Treat peak readiness as a measurable artifact. Test the real production path under isolation. Make bottleneck ownership explicit. 3) Reliability controls as first-class design Circuit breakers, throttling, and degrade strategies. Clear “critical path” vs “best-effort paths.” Operational dashboards and runbooks as part of the product. 4) Automation as a product Reduce manual peak war rooms. Build repeatable pipelines and checklists. Invest in observability and controlled rollouts. 5.3 Anti-patterns (What to avoid) Shared-state scaling: pushing a central DB/core until it breaks. Retry storms: unbounded retries under overload. Implicit degrade: “we’ll decide during the incident” instead of pre-tested plans. Over-indexing on benchmarks: ignoring workload shape, tail latency, and failure semantics. 5.4 KPI evolution (What mature teams measure) Peak systems mature when teams shift from vanity metrics to operational truth: ...

May 2, 2026 · 2 min · Lê Tuấn Anh

Research Index

← Series hub This index explains what each document covers and suggests reading paths depending on your time budget and role. Core Documents (Series Pages) Page What it covers Best for Executive Summary The story, the numbers, and the 3 pillars Execs, CTO/VP Eng, PMs Phase 1: Timeline Key milestones and why the architecture had to change Everyone Phase 2: Architecture LDC/unitization, multi-active, DB and MQ foundations Architects, senior engineers Phase 3: Operations Capacity planning, full-link stress testing, incident command Engineering leadership, SRE Phase 4: Technology Overview Middle platform, payment flow, risk control, SOFAStack Architects, ICs Phase 4: Deep Dive Internals: RPC, MQ, storage engine, transactions, ML risk control Deep technical readers Modern Tech Comparison Mapping to Kubernetes, Kafka/Pulsar, gRPC, modern DBs, service mesh Teams modernizing today Phase 5: Synthesis Patterns, anti-patterns, KPIs, decision framework Leaders + architects Reading Paths (By Time Budget) 10–15 minutes (Executive) Executive Summary 60–90 minutes (Engineering leadership) Executive Summary Phase 1: Timeline Phase 2: Architecture Phase 3: Operations Phase 5: Synthesis 6–10 hours (Full technical deep dive) Read everything above, then: ...

May 2, 2026 · 2 min · Lê Tuấn Anh