← Series hub ← PrevNext →

This phase focuses on the architecture that enables peak scale while preserving correctness and operational control.

2.1 LDC and Unitization (Cell Architecture)

The core idea: a “unit”

A unit is a self-contained slice of the system that can handle end-to-end business flows for a subset of users/traffic.

  • Complete in services: the unit has the full set of required services.
  • Partial in data: data is sharded so each unit owns a subset (e.g., by user-id range).

The key goal is horizontal scaling with isolation: add units to add capacity, and contain failures within a unit when possible.

LDC zones (conceptual model)

Many descriptions of LDC are easiest to understand as three “zones”:

  • RZone (Regional / Unit Zone): the main workhorse units (multiple active).
  • GZone (Global): truly global, low-frequency shared data/control (minimize scope).
  • CZone (City / Common hot data): shared hot data needed frequently across units (optimize latency).

Conceptual sketch:

┌─────────────────────────────────────────────────────────────┐
│                    LDC Architecture                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │
│   │   RZone 1   │   │   RZone 2   │   │   RZone N   │       │
│   │  (Region)   │   │  (Region)   │   │  (Region)   │       │
│   ├─────────────┤   ├─────────────┤   ├─────────────┤       │
│   │ • App Layer │   │ • App Layer │   │ • App Layer │       │
│   │ • Data Shard│   │ • Data Shard│   │ • Data Shard│       │
│   │ • Full Svcs │   │ • Full Svcs │   │ • Full Svcs │       │
│   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘       │
│          │                 │                 │              │
│          └─────────────────┼─────────────────┘              │
│                            │                                │
│                   ┌──────────┴──────────┐                     │
│                   │                     │                     │
│            ┌──────┴──────┐      ┌──────┴──────┐              │
│            │   GZone     │      │   CZone     │              │
│            │  (Global)   │      │  (City)     │              │
│            ├─────────────┤      ├─────────────┤              │
│            │ • Config    │      │ • User Info │              │
│            │ • CIF       │      │ • Login     │              │
│            │ • Shared    │      │ • Frequent  │              │
│            └─────────────┘      └─────────────┘              │
└─────────────────────────────────────────────────────────────┘

Why this matters

ProblemUnitization/LDC response
One shared core becomes a bottleneckSplit state and traffic into multiple units
One failure can cascade globallyLocalize blast radius to a unit when possible
Scaling requires risky “bigger DB” movesScale by adding units + partitions
Cross-region latency and DR are hardMulti-active by design, not an afterthought

2.2 Database Layer: OceanBase (Why the DB became a pillar)

At peak scale, the database is not a component; it is a platform. The architectural thesis is:

  • If the database cannot scale horizontally with strong correctness, the rest of the architecture becomes fragile.
  • Financial systems require durable correctness under concurrency, not just throughput benchmarks.

OceanBase is often described as a distributed SQL system that:

  • Partitions data into table partitions / shards.
  • Replicates partitions across zones for high availability.
  • Uses consensus to maintain correctness under failures.

The architectural implication: “scale out” becomes normal for the database, not a once-a-year emergency.

2.3 Messaging and Asynchronous Boundaries

Large peak systems rely on message-driven decoupling:

  • Soften coupling between services during spikes.
  • Buffer bursty workloads.
  • Enable degrade strategies: keep the core payment path clean while deferring non-critical work.

At this scale, messaging is typically treated as a reliability layer:

  • Clear topic/event naming and contracts.
  • Back-pressure and throttling strategies.
  • Retries, DLQs, and idempotency by design.

2.4 Reliability Patterns (What keeps peaks boring)

Regardless of the exact implementation, the patterns are recognizable in most peak systems:

  • Multi-active: multiple active regions/cells; avoid single-region dependency.
  • Traffic routing: route requests to the correct unit; enforce locality.
  • Circuit breakers / throttling / degradation: preserve the payment core by shedding optional load.
  • Isolation by design: separate critical paths from non-critical paths.
  • Operational determinism: everything above is validated via full-link testing (Phase 3).

2.5 Cloud-native evolution (architecture → efficiency)

Once the architecture supports horizontal scaling and isolation, cloud-native adoption can improve:

  • Elastic capacity (shorter provisioning windows).
  • Operational automation (repeatable environments and deploys).
  • Cost per transaction (efficiency becomes measurable and improvable).

Key Takeaways

  1. Unitization is the scaling unlock: it turns vertical ceilings into horizontal growth.
  2. The database must be designed for peak correctness: correctness and durability are part of the product.
  3. Messaging is a reliability primitive: it’s not only “async,” it’s peak control.
  4. Architecture only works when operations are deterministic: that’s Phase 3.

References & Further Reading