← Series hub Next →

From 50M CNY to 544K TPS: Lessons in Building Planet-Scale Systems

TL;DR

From 2009 to 2019, Alipay increased peak capacity by ~5,440x and reached 544,000 transactions per second (TPS) while maintaining financial-grade reliability (99.99%) and zero data loss targets (RPO = 0).

Three practical lessons stand out:

  1. Design to split (unitization): you cannot scale a monolith forever.
  2. Make confidence deterministic (full-link stress testing): stop guessing and test production paths safely.
  3. Automate operations end-to-end: scale peak readiness without scaling headcount.

The Story: From Crisis to Record

2012: The Breaking Point

The system hit multiple hard limits at once:

  • Oracle could not scale further in a cost-effective way.
  • Connection and throughput ceilings became existential.
  • Even power and cooling constraints limited data center expansion.

The takeaway: the next order of magnitude required an architectural reset, not incremental tuning.

2013: The “Impossible” Goal

Alipay set a goal of 20,000 payments per second with less than a year to deliver.

The breakthrough was LDC (Logical Data Center) and unitization:

  • Split the system into independent units with their own services, data, and cache boundaries.
  • Scale horizontally by adding units, rather than pushing a single shared core.

2019: The Record

By 2019, Alipay reached 544,000 TPS (reported peak), far above typical card network peaks, while still treating correctness and durability as non-negotiable.

Key Results and Metrics

Growth Trajectory (illustrative)

2009:  ████                                (~100 TPS, early events)
2012:  ████████████████                    (~2K TPS, crisis)
2013:  ████████████████████████████        (20K TPS, LDC debut)
2019:  ████████████████████████████████████████████████ (544K TPS)
       0      100K     200K     300K     400K     500K

Confidence Evolution

YearPre-event confidenceWhat changed
2013~60%Manual testing + hope
2014+~95%Automated full-link stress testing

Cost Efficiency (directional)

MetricBefore elastic maturityAfter elastic maturityEffect
Cost per transactionBaseline~50%Material reduction
Capacity preparationYear-round reservation1-2 months elasticityLarge savings

The Solution: 3 Pillars

Pillar 1: Unitization (LDC Architecture)

Problem: a shared-state monolith has a hard ceiling.

Solution: LDC + independent units. Each unit is self-contained and scales horizontally.

LDC (Logical Data Center)

RZone 1     RZone 2     RZone 3     ...     RZone N
[Unit A]    [Unit B]    [Unit C]            [Unit N]
  - services   - services   - services        - services
  - data       - data       - data            - data
  - cache      - cache      - cache           - cache

GZone (global): shared config / global coordination
CZone (city):  hot data for cross-city access

Result: add a new RZone/unit -> add capacity, without rewriting the entire system.

Problem: you cannot operate peak events with “we think it’s fine.”

Solution: production-like, end-to-end stress testing (full-link):

  • Replay realistic traffic patterns.
  • Isolate data safely (shadow tables / isolation strategies).
  • Turn unknown failure modes into measurable, reproducible outcomes.

Operational payoff:

  • Early years: large war rooms and manual overnight validation.
  • Later: significantly fewer people can run the same readiness process because it is automated.

Pillar 3: Financial-Grade Database (OceanBase)

Problem: traditional single-node databases become a scaling and reliability bottleneck.

Solution: invest in a distributed, transaction-capable database engineered for finance:

  • Distributed transactions at high throughput.
  • Strong correctness guarantees as a first-class requirement.
  • Engineering focus on storage efficiency and compaction under load.

Business Impact (Why This Matters)

For engineering leadership, the pattern is clear:

  • Scale comes from splitting work and state, not only adding hardware.
  • Reliability comes from operational design and testing discipline, not heroics.
  • Cost improves when elasticity and automation are built into the architecture.

For executives:

  • Peak events become a repeatable operational routine, not a once-a-year gamble.
  • Incident impact is reduced by multi-active design, automated failover, and clear downgrade strategies.

Cultural Notes (How Teams Survive Peak Events)

Double 11 was treated as both an engineering and psychological challenge. One widely cited internal ritual was bringing Guan Yu (Guan Gong) imagery into preparation rooms as a reminder of humility: even with strong engineering, peaks contain uncertainty unless you engineer it away through testing and automation.

Alipay Stack vs Modern Cloud-Native (Quick Take)

If you are building today, you would likely express similar ideas using different tooling:

  • Unitization -> multi-region / cell-based architecture, sharding, multi-cluster patterns.
  • Full-link stress testing -> production-like load testing with strict isolation and observability.
  • OceanBase/RocketMQ/SOFA -> modern distributed DBs, Kafka/Pulsar, gRPC, and service mesh stacks.

The core lesson is stable across eras: architecture plus operations is the product.

Action Items (Practical)

If you’re modernizing a high-throughput system:

  1. Identify the shared-state bottlenecks and define unit boundaries.
  2. Build a safe, automated full-link stress test pipeline.
  3. Treat peak readiness as an operations product with dashboards, drills, and playbooks.
  4. Make capacity and downgrade strategies explicit and tested, not implicit assumptions.