The Problem: Pushing Instant Notifications to Millions of Devices
Answer-first: Uber’s RAMEN system maintains persistent gRPC bidirectional streams to every active driver app. When a match is made, the ride offer travels from the matching engine to the driver’s phone in under 100ms — without polling. This is how millions of connections are held open simultaneously without crashing the backend.
When DISCO decides to match you with Driver John Doe, the system must:
- Send the ride offer to exactly John Doe’s phone (out of millions of connected phones).
- Deliver it in milliseconds (not seconds).
- Ensure the driver receives it even if their 4G connection is weak.
- Simultaneously push the driver’s location back to your app so you can watch the car move on the map.
There are two main approaches: Polling (asking continuously) and Push (proactively sending).
Polling vs. Push
Polling (The Old Way — Inefficient)
The Driver App asks the server every 3 seconds: "Do you have any rides for me?"
GET /api/v1/offers?driver_id=abc123
→ Response: { "offers": [] } ← Nothing here
GET /api/v1/offers?driver_id=abc123 (3 seconds later)
→ Response: { "offers": [] } ← Still nothing
GET /api/v1/offers?driver_id=abc123 (3 seconds later)
→ Response: { "offers": [...] } ← Got a ride! But it's already 0-3 seconds delayed
The Issues:
- 5 million drivers × 1 request/3s = 1.67 million requests/second (just asking)
- 99% of requests return empty → Massive waste of server resources
- Latency of 0-3 seconds → Another driver might accept the ride first
- Battery drain: the radio chip must be constantly active
Push (RAMEN — Efficient)
The server maintains an OPEN connection with every driver app.
When a ride offer is ready, the server PROACTIVELY pushes it down instantly.
Driver App ◄═══ gRPC Stream (live connection) ═══► RAMEN Server
Advantages:
- Latency: < 100ms (near-instantaneous)
- No wasted empty requests
- The radio chip is only active when there's actual data to receive
- Significant battery savings
RAMEN — Real-time Asynchronous Messaging Network
RAMEN is the push messaging infrastructure built in-house by Uber. It maintains millions of concurrent persistent connections to push real-time data to rider and driver apps.
Three-Tier Architecture
┌────────────────────────────────────────────────────────────┐
│ RAMEN Architecture │
│ │
│ ┌──────────────────┐ │
│ │ Fireball Service │ "When to push?" │
│ │ (Decision Engine)│ • Consumes Kafka events │
│ │ │ • Evaluates business rules │
│ │ │ • Handles priority, localization │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ API Gateway │ "What to push?" │
│ │ (Payload Builder)│ • Aggregates data from services │
│ │ │ • Builds message payloads │
│ │ │ • Serializes (Protobuf) │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ RAMEN Server │ "How to push?" │
│ │ (Delivery Layer) │ • Manages millions of connections │
│ │ │ • Routes to the correct device │
│ │ │ • Guarantees at-least-once delivery │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ Mobile Devices (Millions) │
└────────────────────────────────────────────────────────────┘
The Evolution of the Transport Protocol
Generation 1: Server-Sent Events (SSE) over HTTP/1.1
Client ──── HTTP/1.1 Connection ────► RAMEN Server
◄═══ SSE (Server → Client only) ═══
Characteristics:
✓ Simple, browser-friendly
✗ Unidirectional: only Server → Client
✗ ACK (acknowledgments) must be sent via separate HTTP POST requests (wasting connections)
✗ Head-of-line blocking: large messages block smaller heartbeats
✗ Text-based JSON: heavy payloads
Generation 2: gRPC over QUIC/HTTP3 (Current)
Client ◄══ gRPC Bidirectional Stream ══► RAMEN Server
(HTTP/3 + QUIC)
Characteristics:
✓ Full-duplex: both sides send data simultaneously on the same connection
✓ Binary framing (Protobuf): small payloads, low CPU usage
✓ Multiplexing: multiple streams on 1 connection, no head-of-line blocking
✓ QUIC: UDP-based, much more resilient on weak mobile networks
✓ ACKs sent right on the current stream (no separate connection needed)
✓ Connection migration: switching networks (WiFi → 4G) doesn't break the connection
Scalability: Managing Millions of Connections
Apache Helix + ZooKeeper
RAMEN cannot run on a single server — it requires a cluster of hundreds of servers, each holding tens of thousands of live connections.
RAMEN Cluster Management:
ZooKeeper: Stores topology metadata (which servers are alive?)
Apache Helix: Manages sharding and automatic rebalancing
User UUID "abc123" → hash() → Shard 42 → RAMEN Server #7
User UUID "def456" → hash() → Shard 18 → RAMEN Server #3
If server #7 dies:
1. Helix detects it (heartbeat timeout)
2. Helix moves Shard 42 to server #9
3. Client "abc123" reconnects → Load Balancer → server #9
4. Continues receiving messages seamlessly
Stateful Servers — The Unique Challenge
Unlike standard REST API servers (which are stateless), RAMEN servers are stateful — each server holds specific TCP/gRPC sockets for specific users. Routing must be exact: a message for user “abc123” must reach the exact server holding that user’s socket.
Routing Flow:
Fireball: "Push ride offer to driver abc123"
│
▼
Routing Layer: hash("abc123") → Server #7
│
▼
Server #7: Finds the socket for abc123 in memory → Pushes the message
Ensuring Reliability
At-Least-Once Delivery
Mobile networks are notoriously unreliable — 4G signals can drop for a few seconds and return. RAMEN guarantees that a message is delivered at least once by using:
Persistence Layer:
Cassandra (Durable Storage) Redis (In-Memory Cache)
┌────────────────────┐ ┌────────────────────┐
│ Source of truth │ │ Absorb traffic │
│ Stores messages │◄───────│ bursts │
│ permanently for │ │ Thundering herd │
│ retries │ │ protection │
└────────────────────┘ └────────────────────┘
Flow:
1. Message arrives → Written to Cassandra (backup)
2. Cached in Redis (fast access)
3. Pushed via gRPC stream to the device
4. Device sends an ACK
5. If no ACK is received within 10s → Retry from Cassandra
6. Receives ACK → Marks as delivered
Sequence Numbers — Handling Reconnects
When a client loses its connection and reconnects, how does the server know which messages it has already received?
Every message has an incrementing sequence number:
Server → Client:
[seq=1] Ride offer
[seq=2] ETA update
[seq=3] Driver location ← Connection lost here
Client reconnects:
"Last received: seq=2"
Server:
→ Resends everything from seq=3 onwards (doesn't resend seq=1, seq=2)
[seq=3] Driver location (retry)
[seq=4] Driver location (new)
...
Connection Draining — Preventing Thundering Herds
When deploying new code or scaling down a cluster, RAMEN cannot simply drop millions of connections at once — this would cause a thundering herd (millions of clients reconnecting simultaneously, crashing the entire system).
Graceful Shutdown Flow:
1. RAMEN Server #7 needs to shut down
2. Server #7 stops accepting NEW connections
3. Sends a "Graceful Disconnect" to all currently connected clients
Message: { type: "DISCONNECT", backoff_hint_ms: random(1000, 30000) }
4. Each client receives a different backoff_hint:
- Client A: waits 2.3 seconds, then reconnects
- Client B: waits 15.7 seconds, then reconnects
- Client C: waits 8.1 seconds, then reconnects
5. Reconnections are spread evenly over 30 seconds → no thundering herd
Fallback: Silent Push Notifications
When a driver app is pushed to the background by the operating system (due to Android/iOS power management), the gRPC stream will be closed. In this scenario, RAMEN uses Silent Push Notifications via APNs (Apple) / FCM (Google) to “wake up” the app:
Primary Flow (App in foreground):
RAMEN Server ══► gRPC Stream ══► Driver App ✓ (< 100ms)
Fallback Flow (App in background):
RAMEN Server ──► FCM/APNs ──► OS wakes app ──► App reconnects gRPC
(Takes 1-5 seconds)
Grab’s Approach: WebSocket + Istio
Grab didn’t build a massive custom system like RAMEN; they use a simpler approach:
- WebSockets for real-time bidirectional communication.
- Istio Service Mesh manages routing, load balancing, and mTLS.
- FCM/APNs act as fallbacks when the app is backgrounded.
Pros: Simpler to maintain, uses open standards. Cons: WebSockets over HTTP/1.1 suffer from head-of-line blocking and aren’t as robust as gRPC/QUIC on weak networks.
Summary of the End-to-End Real-Time Flow
THE COMPLETE REAL-TIME PIPELINE:
① Driver moves
→ GPS Sensor → Kalman Filter → Batches 3 points
→ gRPC Stream → Load Balancer → Location Service
② Location Service
→ Converts GPS → H3 Index
→ Produces to Kafka "driver.location.updates"
③ Kafka → Consumers:
├── Redis GEO (updates driver's location)
├── Flink (calculates supply-demand → Surge Pricing)
└── Analytics Pipeline (Data Lake)
④ Rider requests a car
→ Demand Service → Kafka "ride.requests"
⑤ DISCO Matching Engine
→ Queries Redis (finds nearby drivers)
→ Routing Service (calculates actual ETA)
→ Hungarian Algorithm (batched matching)
→ Selects the optimal driver
⑥ RAMEN Push
→ Fireball (decision) → API Gateway (payload)
→ RAMEN Server → gRPC Stream → Driver Phone
→ "You have a new ride request!"
⑦ Driver accepts
→ Trip Service → Kafka "ride.status.changes"
→ RAMEN Push → Rider App: "Your driver is on the way!"
→ Location stream starts pushing driver's position to the Rider App
→ Car moves smoothly on the map 🚗
Total elapsed time: < 2 seconds
Official Reference Sources
| Source | Content |
|---|---|
| Uber Eng: H3 Hexagonal Index | Hexagonal gridding algorithm |
| Uber Eng: RAMEN | Push messaging architecture |
| Uber Eng: DISCO | Matching Engine |
| Uber Eng: DeepETA | ML model for ETA prediction |
| Grab Eng: Fulfilment Platform | Dispatch platform architecture |
| Grab Eng: DispatchGym | RL framework for dispatch |
| Lyft Eng: Real-time Map Matching | Kalman Filters & Map Matching |
| H3 Documentation | API reference for H3 |
| Google S2 Geometry | API reference for S2 |
Congratulations on completing the series! You now deeply understand every architectural layer behind that smoothly moving car on your map. From the GPS sensor → Kalman Filter → Kafka → H3 → DISCO → RAMEN → Your App. Every layer represents a fascinating distributed engineering challenge.