This is a deep-dive research series exploring the backend architecture of PayPay, Japan’s leading mobile payment platform with over 70 million users and 7.8 billion annual transactions. We analyze how they handle massive spike traffic during promotional campaigns, ensure strict ACID data consistency, operate a reliable GitOps platform at 100+ microservices scale, and — as of 2025 — how they are becoming AI-native.

Series Contents

Executive Summary: PayPay's Engineering Evolution

Context: From Zero to Japan’s Payment Infrastructure PayPay launched in October 2018 as a joint venture between SoftBank, Yahoo! JAPAN, and India’s Paytm — a company already battle-hardened by the chaos of scaling mobile payments across hundreds of millions of Indian users. The Paytm DNA gave PayPay a head start, but Japan’s market presented its own unique challenges: extremely high consumer expectations for reliability, strict financial regulation, and a population that had been largely cash-dependent. ...

May 5, 2026 · 5 min · Lê Tuấn Anh

Part 1 — The Foundation: Microservices & GitOps

Bounded Contexts & Microservices When PayPay launched, the architecture needed to be flexible enough to iterate rapidly while remaining stable enough to handle financial transactions at scale. They adopted a Microservices Architecture hosted entirely on AWS, organized around the principles of Domain-Driven Design (DDD). Instead of a massive monolith where a bug in the coupon service could take down payment processing, the system is divided into logical business domains (Bounded Contexts). Each domain owns its data model, its API contracts, and its deployment lifecycle — completely independently: ...

May 5, 2026 · 5 min · Lê Tuấn Anh

Part 2 — Handling the Surge: Event-Driven & Kafka

The Danger of Synchronous Processing During a massive campaign launch — a sudden 50% cashback flash event, or a billion-yen giveaway — the TPS (Transactions Per Second) can jump 100x in a matter of seconds. Millions of users open the app simultaneously, see the promotion banner, and tap “Pay” within the same 30-second window. If the architecture is purely synchronous: User App → API Gateway → Payment Service → Ledger Service → Database That spike instantly exhausts database connection pools. The Ledger Service times out waiting for connections. Those timeouts cascade back to the Payment Service, which cascades back to the API Gateway, which returns errors to every user. The app effectively crashes — just when it has the most users and the most revenue potential. This is the scenario that broke PayPay in December 2018. ...

May 5, 2026 · 6 min · Lê Tuấn Anh

Part 3 — The Data Layer: From Aurora to TiDB

The Relational Database Bottleneck When PayPay launched, AWS Aurora (MySQL compatible) was the obvious choice for the payment ledger. Aurora is managed, reliable, and well-understood. It scales read capacity easily through Read Replicas. For a startup under urgency to ship, it was the right decision. As PayPay grew to tens of millions of users and transaction volumes climbed through each successive campaign, two problems became unavoidable. Problem 1: The Write Bottleneck. Aurora’s replication model is fundamentally single-primary. All write operations — every payment, every balance update, every ledger entry — must go through a single primary node. You can add as many Read Replicas as you want; the write throughput ceiling is determined entirely by the largest available Aurora instance class. PayPay hit that ceiling. Specifically, binlog processing became the binding constraint: Aurora’s binary log, which powers replication to Read Replicas, could not keep up with the write volume during major campaigns. ...

May 5, 2026 · 6 min · Lê Tuấn Anh

Part 4 — Operations: SRE & Resilience

Designing for Failure At PayPay’s scale — 100+ microservices, thousands of Kubernetes pods, a distributed TiDB cluster spanning three Availability Zones, Kafka clusters under constant write pressure — hardware failure, network partitions, and pod crashes are not edge cases. They are daily operational reality. The architecture must not merely tolerate failures; it must be designed to embrace and absorb them without service disruption. The Platform SRE team owns this problem. Their mandate is to ensure that the payment platform maintains 99.99%+ availability — which means less than 53 minutes of downtime per year — regardless of what the underlying infrastructure does. ...

May 5, 2026 · 8 min · Lê Tuấn Anh

Part 5 — Scaling for Billion-Yen Campaign Traffic

Why Campaigns Are the Ultimate Stress Test For most software systems, traffic grows gradually — and engineering teams have time to react. For PayPay, traffic growth is instantaneous and scheduled: the moment a billion-yen cashback campaign goes live at noon on a Friday, millions of Japanese users simultaneously open the app, see the promotion banner, and tap “Pay.” This is not a gradual ramp. This is a thundering herd — a near-instantaneous traffic spike that the infrastructure must absorb in seconds, or users see errors. And for PayPay, an error during a campaign is not just a bad user experience: it is a front-page news event. The December 2018 campaign that exhausted ¥10 billion in 10 days also generated significant negative press coverage for security issues and system instability. The engineering organization internalized a single lesson from that event: campaigns must be survived by design, not by luck. ...

May 5, 2026 · 7 min · Lê Tuấn Anh

Part 6 — PayPay Goes AI-Native: LLM Hub & RAG (2025)

Why a Payment Platform at Scale Needs an AI Architecture In 2024, PayPay crossed 70 million registered users. At that scale, the support surface becomes enormous: millions of users with questions about payments, delinquent accounts, transaction disputes, and product features. The fraud detection surface grows with every new user and transaction pattern. The internal engineering organization — hundreds of engineers across multiple business units — generates thousands of decisions per week that benefit from knowledge retrieval and synthesis. ...

May 5, 2026 · 8 min · Lê Tuấn Anh