← Series hub ← PrevNext →

Overview

Double 11 (Singles’ Day) began in 2009 and grew into one of the largest online commerce peak events in the world. For Alipay, it became an annual forcing function: every year demanded new throughput, lower latency, and higher certainty under extreme load.

This phase captures the scaling journey as a sequence of constraints and responses: from early manual scaling, to an architecture reset (unitization/LDC), to deterministic operations (full-link stress testing), and finally to cloud-native efficiency.

Timeline (Key Milestones)

2009: A modest start, a new kind of peak

  • Event: First major promotion on Taobao Mall (later Tmall).
  • Scale: Dozens of brands; revenue on the order of tens of millions CNY.
  • Engineering reality: Traffic spiked several times above baseline; engineers relied heavily on manual interventions and vertical tuning.

2010: Peak preparedness becomes intentional

  • Double 11 entered the routine stability agenda rather than being treated as a one-off surprise.
  • Capacity planning was still heuristic (over-provisioning with large multipliers), but it began to formalize.
  • Early stress testing was largely manual and incomplete.

2012: The scaling crisis (multiple hard limits at once)

Several ceilings collided:

  • Database bottlenecks became existential (cost and scalability limits).
  • Connection limits / throughput ceilings showed that the system could not be pushed much further by tuning.
  • Physical infrastructure constraints (power/cooling) highlighted that “just add bigger machines” was not a reliable strategy.

Outcome: a clear conclusion that the next order of magnitude required an architectural shift, not incremental optimization.

2013: LDC (Logical Data Center) and unitization debut

Target: 20,000 payments per second with a compressed delivery timeline.

Breakthrough: unitization via LDC:

  • Split the system into independent units with service and data boundaries.
  • Route traffic into the correct unit (cell) to isolate blast radius and scale horizontally.

Result: the first major step from centralized bottlenecks toward horizontally scalable capacity.

Problem: pre-event confidence was widely described as too low to be acceptable (the system “might survive” rather than “will survive”).

Solution: automated, end-to-end stress testing (full-link):

  • Test the real production path under controlled isolation.
  • Catch systemic issues (dependencies, bottlenecks, failure cascades) that component tests miss.

Result: confidence rose materially, and peak readiness became repeatable rather than heroic.

2019: Peak record milestones

  • Reported peaks reached hundreds of thousands TPS for Alipay.
  • The event was no longer just a performance story; it was an operational and reliability system built to prevent data loss and minimize incident impact.

2020+: Cloud-native era (elasticity and efficiency)

  • Cloud-native adoption was positioned as the fastest path to improved elasticity and operational efficiency.
  • The focus expanded from “survive the peak” to “survive the peak cheaply and repeatedly”: cost per transaction, time-to-prepare, and automation depth became first-class metrics.

What Changed Across the Decade

The most important evolution is not a single technology choice; it is a shift in operating model:

  • From vertical scale to horizontal scale: unitization/cell architecture.
  • From hope-based readiness to test-based confidence: full-link stress testing.
  • From manual peak war rooms to automated peak pipelines: repeatable procedures, checklists, and instrumentation.

Next

Continue to Phase 2: Core Architecture to see how LDC/unitization, multi-active, and the database/messaging layers work together to make peak scale possible.