System Design Interview: Design WhatsApp Chat System

Designing a massive scale chat application like WhatsApp, Facebook Messenger, or WeChat is a challenging system design problem. It requires deep understanding of real-time protocols, connection management, and data consistency. This guide walks you through designing a system that supports 1-on-1 and Group chats for billions of users.

Open Table of Contents

Interview Framework: How to Approach This Problem
Step 1: Clarifying Requirements
Step 2: Core Assumptions and Constraints
Step 3: High-Level Architecture
- System Flow Diagram
- Data Flow (1-on-1 Message)
Step 4: The Hardest Problem - Protocol Selection
Step 5: Key Technical Decision - Connection Management
Step 6: Database Design and Storage
- Storage Patterns
- Schema (Cassandra)
Step 7: Scaling the System
- Group Chats (Fanout)
- Media Handling
Step 8: Security and Permissions
- End-to-End Encryption (E2EE)
- Authentication
Step 9: Handling Edge Cases
- Edge Case 1: Message Ordering
- Edge Case 2: The “Offline” Receiver
Step 10: Performance Optimizations
Real-World Implementations
- WhatsApp (Erlang)
- Discord (Elixir/Rust)
Common Interview Follow-Up Questions
Conclusion
References
YouTube Videos

Interview Framework: How to Approach This Problem

When designing a Chat System:

Clarify scope: 1:1 vs Group? Media support? E2EE?
Focus on Real-time: Latency is king.
State Management: Online/Offline status is tricky.
Delivery Guarantees: Sent -> Delivered -> Read.

Key mindset: This is a stateful system ( unlike typical stateless REST APIs). Managing open connections is the bottleneck.

Step 1: Clarifying Requirements

Questions to Ask the Interviewer

Q: Is this 1-on-1 chat or group chat?

Interviewer: Both. Group chats can have up to 256 members.

Q: What about media files (images/video)?

Interviewer: Yes, support sending small media files.

Q: Do we need to store chat history forever?

Interviewer: Yes, users should see history across devices.

Q: Is End-to-End Encryption (E2EE) required?

Interviewer: For this design, let’s assume server-side encryption (easier to design first), but mention E2EE.

Functional Requirements

Send/Receive Messages: Real-time delivery.
Delivery Status: Sent (tick), Delivered (double tick), Read (blue tick).
Presence: Show “Online” or “Last Seen”.
Cross-device Sync: Chat history syncs between Phone and Web.

Non-Functional Requirements

Low Latency: <100ms delivery.
High Availability: 99.999%.
Consistency: Messages must be ordered correctly (SEQ check).

Step 2: Core Assumptions and Constraints

DAU: 2 Billion users.
Messages per day: 100 Billion.
Concurrency: 10M concurrent connections per region.
Storage: Messages are text (small), media is large (blob storage).

Step 3: High-Level Architecture

“We cannot use standard HTTP Request-Response here. We need a persistent connection.”

System Flow Diagram

flowchart TD
    UserA["User A"] -- CP --> CS["Chat Service / Gateway"]
    UserB["User B"] -- CP --> CS

    CS --> Session["Session Svc (Redis)"]
    CS --> MsgDB[("Message DB")]
    CS --> Group["Group Svc"]
    CS --> Presence["Presence Svc"]

    CS -- Async --> Push["Push Notification Svc"]

    classDef service fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
    classDef storage fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000000
    classDef users fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,color:#000000

    class UserA,UserB users
    class CS,Group,Presence,Push service
    class Session,MsgDB storage

(CP = Connection Pool / WebSocket)

Data Flow (1-on-1 Message)

User A sends message to Chat Service over ongoing WebSocket.
Chat Service saves msg to Message DB (for history).
Chat Service queries Session Service: “Which gateway server is User B connected to?”
Chat Service forwards message to User B’s gateway.
Gateway pushes message to User B.
If User B is offline, trigger Push Notification Service.

Step 4: The Hardest Problem - Protocol Selection

“How do clients communicate server-side?”

Option 1: HTTP Polling / Long Polling

Polling: Client asks “New messages?” every 1s.
- Bad: Wastes bandwidth, high latency.
Long Polling: Server holds request open until message arrives.
- Better, but: Still overhead of creating new connections.

Option 2: WebSockets (The Standard)

How: Bi-directional persistent TCP connection.
Pros: Low overhead, real-time.
Cons: Stateful server management.

Option 3: TCP/MQTT (The Mobile Choice)

MQTT: Lightweight pub-sub protocol optimized for unstable mobile networks and battery life.
Facebook Messenger uses MQTT.
Verdict: Use WebSockets for web clients, MQTT for mobile apps to save battery.

Step 5: Key Technical Decision - Connection Management

With 2B users, we can’t have one server. We need a “Connection Grid”.

Chat Gateway (Stateful): Holds 100k TCP connections each.
Service Discovery (Zookeeper): Keeps track of which machine holds User A’s connection.
Session Cache (Redis): Map UserID -> GatewayNodeID.

When User A sends to User B:

Lookup User B’s GatewayNodeID in Redis.
RPC call to that node: “Push this payload to connection ID X”.

Step 6: Database Design and Storage

Storage Patterns

WhatsApp stores messages locally on device (traditionally), but apps like Telegram/Messenger store centralized. We are designing the centralized version.

Choice: HBase or Cassandra (Wide-Column NoSQL)

Why? Extremely high write throughput.
Access pattern: “Get last 50 messages for ChatID”.

Schema (Cassandra)

Table: Messages
PK: (chat_id, message_id) -- Ordered by message_id (time-sortable)
columns: sender_id, text_content, media_url, status

Note: message_id must be sortable (Snowflake ID or KSUID).

Step 7: Scaling the System

Group Chats (Fanout)

When User A sends to a Group (200 people):

Group Service fetches member list.
Service iterates list and pushes to each member’s connection.
Optimization: If multiple members are on the same Gateway Node, send sending 1 payload to that node, and let the node fan-out locally to connections.

Media Handling

Don’t send binary data over WebSocket/MQTT.

Upload Image to Object Storage (S3) via HTTP API.
Get URL.
Send URL text message over Chat System.

Step 8: Security and Permissions

End-to-End Encryption (E2EE)

Uses Signal Protocol.
Server handles encrypted blobs only. It cannot read messages.
Keys are exchanged initially (Public Key Infrastructure).

Authentication

Standard JWT/Session Token exchanged upon WebSocket Handshake.

Step 9: Handling Edge Cases

Edge Case 1: Message Ordering

Scenario: Message A sent at 10:00:01, Message B at 10:00:02. Network delays might flip them. Solution: Client assigns a local Sequence Number. Receiver re-orders buffer based on SeqNum before displaying.

Edge Case 2: The “Offline” Receiver

Scenario: User B is on a flight. Solution:

Server acts as buffer. DB stores “Unread”.
When B connects, Gateway queries DB for “Messages > Last_Ack_ID” and replays them.

Step 10: Performance Optimizations

Keep-Alive: Send heartbeat every 30s to prevent TCP connection timeout by cellular carriers.
Pagination: Only load last 20 messages on open. infinite scroll fetches older (Lazy Loading).
Protocol Buffers: Use Protobuf instead of JSON to save bandwidth.

Real-World Implementations

WhatsApp (Erlang)

Built on Erlang (BEAM VM) for massive concurrency.
Tuned FreeBSD kernel to handle 2 Million connections per server in 2012!
Uses Mnesia (Erlang DB) and now Facebook infra.

Discord (Elixir/Rust)

Uses Elixir (built on Erlang VM) for Gateway.
Migrated specific hot paths to Rust for performance.

Common Interview Follow-Up Questions

Q: How to handle “Last Seen”?

Answer: “Use a presence service backed by Redis:

Client sends heartbeat every 20-30 seconds while app is active.
Gateway updates presence:userId with short TTL.
Online status is derived from key freshness, not permanent DB writes.
Respect privacy settings (hide last seen / contacts-only visibility).

Trade-off: Frequent heartbeats improve freshness but increase battery and network usage.”

Q: How to handle Read Receipts in Groups?

Answer: “I use tiered behavior by group size:

Small groups: store per-user read ACKs for each message.
Medium groups: store only aggregate counters plus sender-visible sample.
Very large channels: disable full per-user receipts to avoid write explosion.
Batch receipt writes asynchronously to smooth spikes.

This preserves UX where it matters and keeps storage costs under control.”

Q: How do you guarantee message order when users reconnect after being offline?

Answer: “Order comes from server-assigned sequence numbers:

Each conversation has monotonic sequence IDs.
Offline client stores unsent messages locally with temporary IDs.
On reconnect, server acks accepted messages and returns missing range.
Client reorders by server sequence and reconciles temporary IDs.

This handles retries safely and prevents duplicate or out-of-order rendering.”

Q: How would you support multi-device end-to-end encryption securely?

Answer: “I would use per-device identity keys with session ratchets:

Register every device with its own long-term key pair.
Establish one encrypted session per sender-device to receiver-device pair.
Rotate session keys regularly and after suspicious events.
Store only encrypted payloads and minimal metadata server-side.

Trade-off: Strong security increases key-management complexity, especially for backup and restore flows.”

Q: How do you handle spam and abuse without breaking privacy?

Answer: “Combine metadata signals and user controls:

Rate-limit new-account outbound messages and group invites.
Use reputation scores from metadata (send velocity, block/report rates).
Add friction for risky actions (captcha, temporary cooldown).
Prioritize user reporting and fast account enforcement pipelines.

This limits abuse while keeping message content encrypted.”

Conclusion

Designing Chat is all about connection management. Statefulness makes scaling harder than typical web apps.

Key Takeaways:

Use WebSockets/MQTT for persistent connection.
Cassandra for storing chat history (Write-heavy).
Redis for transient session data (Who is on which server?).
Sequence ID for ordering.

References

YouTube Videos

“Distributed Systems in One Lesson” - Hussein Nasser [https://www.youtube.com/watch?v=Y6Ev8GIlbxc]
“Microservices Communication Patterns” - IBM Technology [https://www.youtube.com/watch?v=xDuwrtwYHu8]