System Design Interview: Collaborative Document Editor Like Google Docs

High-Level System Design: Building Google Docs-Like Collaborative Editors

This guide walks through how to design a collaborative document editing system like Google Docs in a system design interview. We’ll follow the same thought process you’d use when explaining your design to an interviewer, starting with requirements, making explicit assumptions, and building the architecture step by step with clear reasoning for each decision.

Open Table of Contents

Interview Framework: How to Approach This Problem
Step 1: Clarifying Requirements
Step 2: Core Assumptions and Constraints
Step 3: High-Level Architecture
Step 4: The Hardest Problem - Conflict Resolution
Step 5: Choosing Between OT and CRDT
Step 6: Database Design and Storage
Step 7: Scaling the System
Step 8: Security and Permissions
- Permission Model
- Permission Checking Flow
Step 9: Handling Edge Cases
Step 10: Performance Optimizations
Real-World Implementations
Common Interview Follow-Up Questions
Conclusion
References
YouTube Videos

Interview Framework: How to Approach This Problem

In a system design interview, when asked to design Google Docs, here’s the structured approach you should follow:

Clarify requirements (5 minutes) - Ask questions, don’t assume
State assumptions (2 minutes) - Make constraints explicit
High-level design (10 minutes) - Draw boxes and arrows
Deep dive (20 minutes) - Focus on the hardest problems
Scale and optimize (10 minutes) - Discuss bottlenecks
Edge cases (3 minutes) - Show thoroughness

Key mindset: Think out loud, explain your reasoning, and involve the interviewer in your decisions. Don’t just draw diagrams silently.

Step 1: Clarifying Requirements

Questions to Ask the Interviewer

Before jumping into design, ask these questions to clarify scope:

Q: What’s the core functionality we need to support?

Multiple users editing simultaneously? Yes
Rich text formatting (bold, italic, lists)? Yes
Images and media embeds? Start with text, nice to have
Comments and suggestions? Nice to have, not MVP

Q: What scale are we targeting?

How many concurrent users per document? Let’s say up to 50 for MVP
How many total users? Start with 10 million users
Document size limits? Up to 10,000 words typical, 100,000 max

Q: What are the critical performance requirements?

Latency for seeing other users’ changes? Under 200ms globally
Availability target? 99.9% for MVP (8 hours downtime/year)

Q: Do we need offline editing?

Start with online-only, discuss offline as enhancement

Q: Are we building from scratch or integrating?

Assume we can use existing auth systems, focus on collaboration

Functional Requirements (After Clarification)

Based on the answers, here’s what we’ll design:

Real-time multi-user editing - The core challenge
Text editing with basic formatting - Bold, italic, lists, headings
Live cursor positions - Show where other users are typing
Save and version history - Auto-save, ability to restore previous versions
Access control - Owner, editor, viewer permissions
Share via link - Easy sharing mechanism

Non-Functional Requirements

Low latency: Changes visible to others in <200ms
Consistency: All users see the same final document (eventual consistency OK)
Availability: 99.9% uptime
Scalability: Handle 10M users, 50 concurrent editors per doc
Data durability: Zero data loss

Step 2: Core Assumptions and Constraints

Always state these explicitly to the interviewer:

Traffic Assumptions

Total Users: 10 million
Daily Active Users (DAU): 1 million (10% of total)
Concurrent Users: 100,000 (10% of DAU)
Documents Created Daily: 200,000
Average Document Size: 50 KB
Peak Concurrent Editors per Document: 50

Read vs Write Pattern

Read:Write Ratio = 3:1
- Reading (viewing documents): 75% of traffic
- Writing (editing): 25% of traffic
- Most edits are small (typing character by character)

Scale Calculations

Storage Estimation:

Documents: 50 million total
Average size: 50 KB
Total storage: 50M × 50KB = 2.5 TB for documents
With versions (3x): ~7.5 TB
With replication (3x): ~22.5 TB

Bandwidth Estimation:

Concurrent editing sessions: 50,000 during peak
Operations per user per second: 2 (while actively typing)
Total operations per second: 100,000 ops/sec
Average operation size: 100 bytes
Bandwidth: 100K × 100 bytes = 10 MB/sec (80 Mbps)

Database Queries:

Document loads per second: 10,000
Writes per second: 100,000 (operations)
These numbers suggest we need caching and optimized storage

Technology Constraints

Assume we’ll use:

WebSocket for real-time communication (explain why over HTTP polling)
NoSQL database for flexibility (explain choice later)
Redis for caching and pub/sub
CDN for static assets

Why these choices?

WebSocket: Bidirectional, persistent connection = low latency for real-time
NoSQL: Flexible schema for documents, better horizontal scaling
Redis: In-memory = fast reads, built-in pub/sub for real-time events

Step 3: High-Level Architecture

Component Overview

“Let me start with a high-level architecture and then we’ll dive deep into the most challenging part - conflict resolution.”

Component Overview

“Let me start with a high-level architecture and then we’ll dive deep into the most challenging part - conflict resolution.”

flowchart TD
    ClientLayer(Client Layer<br/>Browser/App) --> LB[Load Balancer / API Gateway<br/>Sticky Sessions]
    LB --> CollabSvc[Collaboration Service<br/>WebSocket]
    LB --> DocSvc[Document Service<br/>REST]
    LB --> AuthSvc[Authentication Service<br/>OAuth]

    CollabSvc --> Redis[(Redis Pub/Sub<br/>Real-time)]
    CollabSvc .-> MongoDB

    DocSvc --> MongoDB[(MongoDB<br/>Documents)]

    AuthSvc --> Postgres[(PostgreSQL<br/>Users/Perms)]

    MongoDB -.-> S3[S3/GCS<br/>Snapshots]
    MongoDB -.-> ES[Elasticsearch<br/>Search]

    classDef service fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000;
    classDef storage fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000000;
    classDef client fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,rx:10,ry:10,color:#000000;
    classDef infrastructure fill:#f5f5f5,stroke:#616161,stroke-width:2px,color:#000000;

    class ClientLayer client;
    class CollabSvc,DocSvc,AuthSvc service;
    class Redis,MongoDB,Postgres,S3,ES storage;
    class LB infrastructure;

Walking Through the Data Flow

“Let me explain how a user edit flows through this system:”

User types “Hello” in the browser
- Client captures keystroke immediately
- Applies change to local document (optimistic update for responsiveness)
- Creates an “operation” object: {type: 'insert', position: 0, text: 'Hello'}
Operation sent to Collaboration Service
- WebSocket connection sends operation to server
- Why WebSocket? Persistent connection = no handshake overhead for each edit
- Operation includes: documentId, userId, revision number, timestamp
Server processes operation
- Validates permission (can this user edit?)
- Transforms operation if needed (we’ll explain this in Step 4)
- Stores operation in database
- Publishes to Redis pub/sub channel for this document
Broadcast to other users
- All collaboration servers subscribed to this document’s Redis channel
- They receive the operation and send to their connected clients
- Other users see “Hello” appear in real-time
Async processes
- Document service aggregates operations periodically
- Creates snapshots for version history
- Updates search index

Why This Architecture?

Interviewer might ask: “Why separate Collaboration Service from Document Service?”

Answer:

Collaboration Service needs to be stateful (maintain WebSocket connections) and optimized for real-time throughput
Document Service can be stateless REST API, optimized for CRUD operations
This separation allows independent scaling: we might need 100 collaboration servers but only 20 document servers

Why Redis Pub/Sub?

We’ll have multiple collaboration servers (for scale)
When Server A receives an edit, it needs to notify users connected to Server B
Redis pub/sub is lightweight, in-memory, perfect for this fan-out pattern
Alternative would be direct server-to-server communication (more complex)

Why MongoDB for documents?

Flexible schema (documents can have varying structures)
Good horizontal scaling with sharding
JSON-like storage matches our document format
Could also use PostgreSQL with JSONB, but MongoDB’s replication is simpler

Why PostgreSQL for users/permissions?

User data and permissions need ACID guarantees
Relational data (users, shared links, permission hierarchies)
SQL is better for complex permission queries

Step 4: The Hardest Problem - Conflict Resolution

“Now let’s tackle the most challenging part of this design - what happens when two users edit the same part of the document simultaneously?”

The Core Problem

Imagine this scenario:

Initial Document: "Google"

Timeline:
T0: Both User A and User B have document at version 1: "Google"

T1: User A inserts "Docs" at position 6
    User A's view: "Google Docs"

T2: User B (hasn't seen A's change yet) inserts "Drive" at position 6
    User B's view: "Google Drive"

T3: Operations arrive at server
    What should the final document be?

Three possible approaches:

Approach 1: Last Write Wins (Simple but Bad)

Whoever's operation arrives last overwrites
Final result: "Google Drive" (B's operation arrived second)
Problem: User A's edit disappears! Data loss!
Verdict: ❌ Not acceptable for collaborative editing

Approach 2: Locking (Traditional but Limiting)

Lock the document when someone is editing
User A starts typing → document locked
User B tries to edit → "Document locked by User A"
Problem: Defeats the purpose of real-time collaboration!
Verdict: ❌ Not suitable for Google Docs-like experience

Approach 3: Operational Transformation or CRDTs (Complex but Correct)

Transform operations based on concurrent changes
Both users' intents are preserved
Final result: "Google Docs Drive" or consistent resolution based on algorithm
Verdict: ✅ This is what we need!

“The industry has developed two main approaches for this: Operational Transformation (OT) and Conflict-Free Replicated Data Types (CRDTs). Let me explain both and then we’ll choose one.”

Step 5: Choosing Between OT and CRDT

Operational Transformation (OT)

Core Idea: Transform operations based on what has already happened.

How it works:

// Initial: "Hello"
// User A: Insert "!" at position 5 → "Hello!"
// User B: Insert " World" at position 5 → "Hello World"

// Without transformation:
// If both execute as-is, we get inconsistent states

// With OT:
// Server receives A's operation first
// When B's operation arrives, server transforms it:
// "B wanted to insert at position 5, but A already inserted 1 char
//  So B's operation should now be at position 6"
// Final: "Hello! World" (consistent!)

Implementation approach:

// Simplified OT transformation function
function transform(operationA, operationB) {
  // If A inserts before B's position, shift B's position
  if (operationA.type === "insert" && operationB.type === "insert") {
    if (operationA.position <= operationB.position) {
      operationB.position += operationA.text.length;
    }
  }

  // If A deletes before B's position, shift B's position back
  if (operationA.type === "delete" && operationB.type === "insert") {
    if (operationA.position < operationB.position) {
      operationB.position -= operationA.length;
    }
  }

  // Many more cases to handle...
  return operationB;
}

Pros:

✅ Mature, battle-tested (Google Docs uses this)
✅ Server has authority (easier to debug)
✅ Good for rich text editing
✅ Deterministic outcomes

Cons:

❌ Complex to implement correctly (many edge cases)
❌ Requires server coordination (higher latency)
❌ Hard to support offline editing

Conflict-Free Replicated Data Types (CRDT)

Core Idea: Every character has a unique, immutable ID. Conflicts can’t happen because operations are commutative.

How it works:

// Instead of positions, each character has a unique ID
// Initial: "Hello"
// Represented as: [
//   {id: '1-A', char: 'H'},
//   {id: '2-A', char: 'e'},
//   {id: '3-A', char: 'l'},
//   {id: '4-A', char: 'l'},
//   {id: '5-A', char: 'o'}
// ]

// User A inserts '!' with id '6-A' after '5-A'
// User B inserts ' World' with ids '6-B', '7-B', ... after '5-A'

// Both operations can be applied in any order!
// Final order determined by ID comparison rule
// Result is always consistent

Pros:

✅ Works offline perfectly (no server needed for consistency)
✅ Lower latency (no need to wait for server transform)
✅ Built for distributed systems
✅ Eventually consistent by design

Cons:

❌ More memory overhead (store IDs for each character)
❌ Complex garbage collection (deleted chars need tombstones)
❌ Can produce unexpected formatting results
❌ Less mature for rich text

Our Choice: Operational Transformation

“For this interview, I’ll choose OT because:”

Server-centric is simpler for MVP: Single source of truth makes debugging easier
Better for rich text: OT handles formatting (bold, italic) more naturally
Proven at scale: Google Docs has used OT successfully for 15+ years
Lower memory usage: No need to store IDs for every character

Trade-off: We sacrifice offline editing capability and have higher latency, but we gain simpler semantics and proven reliability.

“If the interviewer asks about offline support, I’d say: We could add offline as v2 by caching locally and using CRDTs for offline chunks, then reconciling with server on reconnect.”

Implementing OT - The Server-Side Logic

“Let me show you how the server handles concurrent operations:”

// OT Server - The Source of Truth
class OTCollaborationServer {
  constructor() {
    this.documents = new Map(); // documentId -> { content, revision, history }
  }

  async handleOperation(documentId, operation, clientRevision) {
    const doc = this.documents.get(documentId);

    // Step 1: Get all operations that happened after client's revision
    // Client is at revision 5, but server is at revision 8
    // We need to transform against operations 6, 7, 8
    const concurrentOps = doc.history.slice(clientRevision);

    // Step 2: Transform the client's operation against each concurrent operation
    let transformedOp = operation;
    for (const concurrentOp of concurrentOps) {
      transformedOp = this.transform(transformedOp, concurrentOp);
    }

    // Step 3: Apply transformed operation to document
    doc.content = this.applyOperation(doc.content, transformedOp);

    // Step 4: Add to history and increment revision
    doc.history.push(transformedOp);
    doc.revision++;

    // Step 5: Broadcast transformed operation to all other clients
    await this.broadcast(documentId, {
      operation: transformedOp,
      revision: doc.revision,
    });

    return { success: true, revision: doc.revision };
  }

  // Transform operation A against operation B
  // Returns the transformed version of A
  transform(opA, opB) {
    // Case 1: Both are inserts
    if (opA.type === "insert" && opB.type === "insert") {
      if (opA.position < opB.position) {
        // A is before B, B's position shifts forward
        return opB; // No change to A
      } else if (opA.position > opB.position) {
        // B is before A, A's position shifts forward
        return { ...opA, position: opA.position + opB.text.length };
      } else {
        // Same position - use tie-breaking (user ID, timestamp)
        // Let's say lower user ID wins
        if (opA.userId < opB.userId) {
          return opB;
        } else {
          return { ...opA, position: opA.position + opB.text.length };
        }
      }
    }

    // Case 2: Insert vs Delete
    if (opA.type === "insert" && opB.type === "delete") {
      if (opA.position <= opB.position) {
        // Insert before delete, delete position shifts
        return { ...opB, position: opB.position + opA.text.length };
      } else if (opA.position >= opB.position + opB.length) {
        // Insert after delete, insert position shifts back
        return { ...opA, position: opA.position - opB.length };
      } else {
        // Insert within deleted range - edge case!
        // Insert wins, delete adjusts
        return { ...opA, position: opB.position };
      }
    }

    // Case 3: Delete vs Delete
    if (opA.type === "delete" && opB.type === "delete") {
      // Complex! Need to handle overlapping deletes
      return this.transformDeleteDelete(opA, opB);
    }

    return opA;
  }

  // Apply operation to content string
  applyOperation(content, operation) {
    if (operation.type === "insert") {
      return (
        content.slice(0, operation.position) +
        operation.text +
        content.slice(operation.position)
      );
    } else if (operation.type === "delete") {
      return (
        content.slice(0, operation.position) +
        content.slice(operation.position + operation.length)
      );
    }
    return content;
  }
}

Key Points to Explain:

Server maintains revision number: Every operation increments it
Clients send their current revision: Server knows what they’ve seen
Transform against unseen operations: Bridge the gap between client and server state
Broadcast transformed version: Ensures all clients apply the same operation

Why this approach works:

Server is the single source of truth (no ambiguity)
Clients can be at different revisions (handles network delays)
Operations are transformed, not rejected (preserves user intent)

Step 6: Database Design and Storage

“Now let’s talk about how we persist this data. We have different types of data with different access patterns.”

Data Classification

1. Hot Data (Active Editing)

Current document state
Recent operations (last 100)
Active user sessions
Access Pattern: Very frequent reads/writes, low latency critical
Storage: Redis (in-memory)

2. Warm Data (Recent Documents)

Document metadata
Complete operation history
Version snapshots
Access Pattern: Frequent reads, moderate writes, can tolerate 10-50ms latency
Storage: MongoDB (disk-based, indexed)

3. Cold Data (Archives)

Old version snapshots
Deleted documents
Audit logs
Access Pattern: Rare reads, mostly sequential, latency not critical
Storage: S3/GCS (object storage)

MongoDB Schema Design

“Let me design the MongoDB schema with explanation for each choice:“

// Documents Collection
{
  _id: ObjectId("507f1f77bcf86cd799439011"),
  documentId: "doc_abc123",  // Public-facing ID
  title: "Q4 Planning Document",
  ownerId: "user_xyz",
  createdAt: ISODate("2026-01-15T10:00:00Z"),
  updatedAt: ISODate("2026-02-10T14:30:00Z"),
  currentRevision: 1247,  // Current version number

  // Document content stored as Quill Delta format
  // Why? It's designed for rich text and OT
  content: {
    ops: [
      { insert: "Hello " },
      { insert: "World", attributes: { bold: true } },
      { insert: "\n" }
    ]
  },

  // Permissions map for quick lookups
  permissions: {
    "user_xyz": "owner",
    "user_abc": "editor",
    "user_def": "viewer"
  },

  // Public access setting
  publicAccess: "private",  // private | link | public

  // Search index (denormalized for performance)
  searchableContent: "Hello World",  // Plain text for full-text search

  // Metadata
  metadata: {
    wordCount: 2,
    lastEditedBy: "user_abc",
    activeEditors: ["user_abc", "user_def"]  // Current active users
  }
}

// Why this structure?
// - documentId separate from _id: We can change internal IDs without breaking URLs
// - Denormalized permissions: Faster permission checks (no joins)
// - searchableContent: MongoDB text index for fast  searches
// - currentRevision: Quick validation of client state

// Operations Log Collection (for OT)
{
  _id: ObjectId(...),
  documentId: "doc_abc123",
  revision: 1247,  // Incrementing revision number
  userId: "user_abc",
  timestamp: ISODate("2026-02-10T14:30:22Z"),

  // The actual operation
  operation: {
    type: "insert",
    position: 42,
    text: "collaboration",
    attributes: { bold: true }
  },

  // For cleanup (operations older than 30 days can be compacted)
  compacted: false
}

// Indexes:
// 1. { documentId: 1, revision: 1 } - Get operations for a document in order
// 2. { documentId: 1, timestamp: -1 } - Get recent operations
// 3. { compacted: 1, timestamp: 1 } - Find operations to compact

// Why separate collection?
// - Operations grow rapidly (100s per minute while editing)
// - Need to query by revision efficiently
// - Can archive old operations without affecting document reads

// Snapshots Collection (for version history)
{
  _id: ObjectId(...),
  documentId: "doc_abc123",
  revision: 1200,  // Snapshot every 100 revisions
  timestamp: ISODate("2026-02-10T14:00:00Z"),
  userId: "user_abc",  // Who was editing at snapshot time

  // Full document state at this revision
  content: {
    ops: [...]  // Complete Quill Delta
  },

  // Stored in S3 for cold storage
  s3Key: "snapshots/doc_abc123/rev_1200.json",

  // Sizes for UI
  size: 52480,  // bytes
  wordCount: 5000
}

// Why snapshots?
// - Can't reconstruct document from millions of operations (too slow)
// - Snapshots every 100 revisions = fast restoration
// - Users can restore to specific versions
// - Older snapshots move to S3 for cost efficiency

Storage Tier Strategy

“Here’s how data flows through different storage tiers:”

class StorageOrchestrator {
  async getDocument(documentId, userId) {
    // Tier 1: Check Redis cache (sub-millisecond)
    let doc = await this.redis.get(`doc:${documentId}`);
    if (doc) {
      console.log("Cache hit - Redis");
      return JSON.parse(doc);
    }

    // Tier 2: Check MongoDB (10-50ms)
    doc = await this.mongodb.collection("documents").findOne({ documentId });
    if (doc) {
      // Promote to Redis for future requests
      await this.redis.setex(`doc:${documentId}`, 3600, JSON.stringify(doc));
      console.log("Cache miss - loaded from MongoDB");
      return doc;
    }

    // Tier 3: Reconstruct from snapshot + operations
    const snapshot = await this.getLatestSnapshot(documentId);
    const operations = await this.getOperationsSince(
      documentId,
      snapshot.revision
    );

    doc = this.reconstructDocument(snapshot, operations);

    // Promote through tiers
    await this.mongodb.collection("documents").insertOne(doc);
    await this.redis.setex(`doc:${documentId}`, 3600, JSON.stringify(doc));

    console.log("Reconstructed from snapshot");
    return doc;
  }

  async saveOperation(documentId, operation) {
    // Write to MongoDB (persistent)
    await this.mongodb.collection("operations").insertOne({
      documentId,
      operation,
      timestamp: new Date(),
    });

    // Invalidate cache (force reload with new operation applied)
    await this.redis.del(`doc:${documentId}`);

    // Create snapshot every 100 operations
    const revision = await this.getCurrentRevision(documentId);
    if (revision % 100 === 0) {
      await this.createSnapshot(documentId, revision);
    }
  }
}

Why this matters:

Active documents stay in Redis (fastest access)
Inactive documents in MongoDB (durable, queryable)
Old versions in S3 (cheapest storage)
System automatically promotes/demotes based on usage

Step 7: Scaling the System

“Let’s discuss how to scale this to millions of users. I’ll identify bottlenecks and solutions.”

Bottleneck 1: Collaboration Servers (Stateful)

Problem: WebSocket connections are stateful - can’t easily add servers

Solution: Sticky sessions + Redis Pub/Sub

User connects → Load balancer routes to Server A (sticky session)
User sends operation → Server A processes it
Server A publishes to Redis → All servers receive it
Servers broadcast to their connected clients

Load Balancer Configuration:

upstream collab_servers {
    # Hash based on client IP or session cookie
    ip_hash;  # Or: hash $cookie_sessionid;

    server collab1.example.com:8080 max_fails=3 fail_timeout=30s;
    server collab2.example.com:8080 max_fails=3 fail_timeout=30s;
    server collab3.example.com:8080 max_fails=3 fail_timeout=30s;
    # Can keep adding servers horizontally
}

server {
    location /collab/ {
        proxy_pass http://collab_servers;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Long timeout for WebSocket
        proxy_read_timeout 3600s;
    }
}

Why this works:

Each user stays connected to same server (sticky session)
Redis ensures cross-server communication
Can scale horizontally by adding more servers
If a server dies, only its connected users need to reconnect

Bottleneck 2: Database Write Throughput

Problem: 100,000 operations/second is a lot for single MongoDB instance

Solution: Sharding + Write-ahead log optimization

// Shard by documentId (documents are independently editable)
// Hash sharding: documentId → shard based on hash

// MongoDB shard key
sh.shardCollection("docs.operations", { documentId: "hashed" });

// Why hashed sharding?
// - Documents are isolated (no cross-document queries)
// - Even distribution across shards
// - Queries by documentId go to single shard (fast)

Write optimization - Batching:

class OptimizedOperationLogger {
  constructor() {
    this.operationBuffer = [];
    this.flushInterval = 100; // ms

    setInterval(() => this.flush(), this.flushInterval);
  }

  logOperation(operation) {
    this.operationBuffer.push(operation);

    // Flush if buffer gets too large
    if (this.operationBuffer.length >= 100) {
      this.flush();
    }
  }

  async flush() {
    if (this.operationBuffer.length === 0) return;

    const batch = this.operationBuffer.splice(0);

    // Single bulk write instead of 100 individual writes
    await this.mongodb.collection("operations").insertMany(batch, {
      ordered: false, // Don't stop on error, insert all that succeed
    });
  }
}

// Why batching?
// - Reduce network round trips
// - MongoDB can optimize bulk inserts
// - 100ms delay acceptable for operation logging (async from user perspective)

Bottleneck 3: Redis Single Point of Failure

Problem: If Redis dies, real-time collaboration stops

Solution: Redis Sentinel or Redis Cluster

// Redis Sentinel for high availability
const redis = new Redis({
  sentinels: [
    { host: "sentinel1.example.com", port: 26379 },
    { host: "sentinel2.example.com", port: 26379 },
    { host: "sentinel3.example.com", port: 26379 },
  ],
  name: "mymaster", // Name of master instance
  // Automatic failover if master dies
});

// Why Sentinel?
// - Automatic master election if primary fails
// - Typically failover happens in 30-60 seconds
// - Clients automatically reconnect to new master
// - During failover, users might see 30-60s delay in real-time updates

Bottleneck 4: Geographic Distribution

Problem: Users in Australia connecting to US server = 200+ms latency

Solution: Regional collaboration servers + Global MongoDB

Architecture:
- Collaboration servers in 10+ regions (AWS regions)
- Route user to nearest region (latency-based routing)
- All regions connect to same MongoDB cluster (global)
- Redis pub/sub in each region, cross-region replication

Example:
- User in Sydney connects to ap-southeast-2 collab server
- User in London connects to eu-west-1 collab server
- Both editing same document
- Operations flow: Sydney → ap-southeast-2 Redis → MongoDB →
  eu-west-1 Redis → London user
- Latency: ~100-150ms (better than 300+ms)

Capacity Planning

“Let me show you the math for scaling:“

Assumptions:
- 100,000 concurrent collaborative editing sessions
- 2 operations/second per user (active typing)
- Total: 200,000 ops/second

Collaboration Servers:
- 1 server handles 2,000 concurrent WebSocket connections
- Need: 100,000 / 2,000 = 50 servers
- With 2x redundancy: 100 servers

Redis:
- 200,000 pub/sub messages/second
- Redis can handle 100,000+ ops/sec per instance
- Need: 2-3 Redis instances (plus replicas)

MongoDB:
- 200,000 write ops/second
- After batching (100ms window): 2,000 bulk writes/second
- 1 MongoDB shard handles ~10,000 writes/second
- Need: 1 shard (but use 3 for redundancy and growth)

Total Infrastructure:
- 100 collaboration servers (e.g., c5.xlarge = ~$10k/month)
- 6 Redis instances (m5.large = ~$1k/month)
- 9 MongoDB instances (3 shards × 3 replicas, r5.xlarge = ~$5k/month)
- Load balancers, networking: ~$2k/month
- Total: ~$18k/month for 100K concurrent users

Step 8: Security and Permissions

“Security is critical. Let me design a robust permission system.”

Permission Model

const PERMISSIONS = {
  OWNER: ["read", "write", "comment", "share", "delete", "manage"],
  EDITOR: ["read", "write", "comment"],
  COMMENTER: ["read", "comment"],
  VIEWER: ["read"],
};

function hasPermission(userRole, action) {
  return PERMISSIONS[userRole]?.includes(action) || false;
}

Permission Checking Flow

“Every operation must be authorized. Here’s the flow:”

async function checkDocumentAccess(req, res, next) {
  const { documentId } = req.params;
  const userId = req.user?.id;

  const document = await Document.findById(documentId);
  if (!document) return res.status(404).json({ error: "Not found" });

  // Check 1: Direct permission
  if (document.permissions[userId]) {
    req.userPermission = document.permissions[userId];
    return next();
  }

  // Check 2: Share link access
  const shareToken = req.query.share_token;
  if (shareToken) {
    const share = await ShareLink.findOne({ token: shareToken, documentId });
    if (share && !share.expired) {
      req.userPermission = share.permission;
      return next();
    }
  }

  // Check 3: Public access
  if (document.publicAccess === "public") {
    req.userPermission = "VIEWER";
    return next();
  }

  return res.status(403).json({ error: "Access denied" });
}

Why this matters: Multi-layered security ensures only authorized users can access documents, with different access paths (direct, shared, public) for flexibility.

Step 9: Handling Edge Cases

“Let me address critical edge cases interviewers often ask about:“

Edge Case 1: Network Disconnection

Scenario: User typing, network drops, reconnects after 10 seconds

Approach:

Client queues operations locally (optimistic UI)
On reconnect, fetch latest server revision
Transform queued operations against missed server operations
Resend transformed operations

class ResilientClient {
  async onReconnect() {
    const serverState = await this.fetchLatestState();

    // Transform pending ops against what we missed
    const missedOps = serverState.operationsSince(this.lastRevision);
    for (let op of missedOps) {
      this.applyRemoteOp(op);
      this.transformPending(op);
    }

    // Resend pending
    for (let op of this.pendingOps) {
      await this.send(op);
    }
  }
}

Edge Case 2: Very Large Documents (100,000+ words)

Solution: Chunking + Lazy Loading

Split document into 10KB chunks
Load only visible chunks + 1 chunk buffer
Operations reference chunk ID + local offset
Unload offscreen chunks to save memory

Edge Case 3: Rapid Typing (100 keystrokes/second)

Solution: Keystroke batching

Buffer operations for 50ms
Combine consecutive insertions into single operation
Send one operation instead of 100
Reduces network traffic by 95%

Edge Case 4: Malicious User Spamming Operations

Solution: Rate limiting

// Max 100 operations per second per user
const rateLimiter = new RateLimit({
  window: 1000, // 1 second
  max: 100, // 100 operations
});

if (!rateLimiter.check(userId)) {
  throw new Error("Rate limit exceeded");
}

Step 10: Performance Optimizations

“Here are key optimizations to make it fast:“

1. Operation Compaction

Combine consecutive operations from same user:

“H” + “e” + “l” + “l” + “o” → “Hello” (5 ops → 1 op)
Run periodically (every 15 minutes)
Reduces storage by 80-90%

2. Cursor Position Throttling

Broadcast cursor moves at most 10 times/second (not 100 times/second):

Reduces bandwidth by 90%
Still feels real-time to users
Use requestAnimationFrame for smooth rendering

3. Smart Caching Strategy

L1: Browser cache (instant)
L2: Redis cache (1-5ms)
L3: MongoDB (10-50ms)
L4: S3 snapshots (100-500ms)

Promote frequently accessed documents up the chain.

Real-World Implementations

Google Docs Architecture

What they use:

Frontend: Custom JavaScript editor
Collaboration: C++ servers with OT
Storage: Bigtable (hot data) + Colossus (cold data)
Caching: Memcached extensively
Scale: 2 billion users globally

Key innovations:

Document chunking (64KB chunks)
Predictive loading
Aggressive compression (Gzip over wire)
Edge caching for static assets

Notion’s Approach

What they use:

Block-based architecture: Everything is a block
CRDT: Fractional indexing for ordering
Storage: PostgreSQL with JSONB
Real-time: WebSockets + Redis
Scale: 20+ million users

Key difference: Notion uses blocks, not characters as the atomic unit. Each block can be reordered independently using fractional indexing.

Figma’s Multiplayer

What they use:

Custom CRDT: Property-based for design objects
Backend: Rust for performance
Client: C++ compiled to WebAssembly
Protocol: Custom binary protocol
Performance: Sub-100ms latency, 60 FPS rendering

Key achievement: Can handle 1000+ concurrent editors on a single design file.

Common Interview Follow-Up Questions

Q: How would you add offline support?

Answer: “I’d use a hybrid approach:

Switch to CRDT for offline chunks (better for distributed edits)
Cache document locally in IndexedDB
Track operations while offline
On reconnect, sync with server using CRDT merge
Fall back to OT for server-side conflict resolution if needed

Trade-off: Added complexity, but enables offline-first experience.”

Q: How do you handle conflicts in formatting?

Answer: “Formatting conflicts are tricky. Approach:

Formatting is metadata attached to character ranges
Use last-write-wins with timestamps for same range
For overlapping ranges, merge attributes (e.g., bold + italic = both)
Server timestamp is source of truth for tie-breaking

Example: If User A bolds ‘Hello’ and User B italicizes ‘Hello’ simultaneously, result is ‘Hello’ with both bold and italic.”

Q: How would you implement version history with branching?

Answer: “Interesting extension! I’d design it like Git:

Each save creates a snapshot with parent pointer
Branch when user restores old version and edits
Store as directed acyclic graph (DAG)
Show visual tree in UI
Allow merge between branches (complex OT problem)

This would require significant UX work to make intuitive, so I’d start with linear history for MVP.”

Q: What if the server goes down while users are editing?

Answer: “Multi-pronged approach:

High availability: Multiple servers, automatic failover (60s max)
Client-side: Queue operations, retry with exponential backoff
Data durability: Operations logged to durable storage immediately
User experience: Show ‘Reconnecting…’ banner, keep local changes
Recovery: On reconnect, client sends queued operations with last known revision

During the ~60s failover window, users can still type (local only), then sync when service recovers.”

Q: How do you prevent data loss?

Answer: “Defense in depth:

Immediate persistence: Operations written to MongoDB before acknowledging
Replication: 3-way replica set for MongoDB
Snapshots: Every 100 revisions, stored in S3 with versioning
Cross-region backup: Async replication to different geographic region
Client-side: Pending operations persisted in IndexedDB
Audit log: Immutable log of all operations for recovery

Recovery time objective (RTO): <1 minute Recovery point objective (RPO): <1 second (last acknowledged operation)“

Conclusion

Designing a collaborative document editor like Google Docs is a challenging system design problem that tests your understanding of:

Real-time systems - WebSockets, pub/sub, low-latency architectures
Distributed systems - Conflict resolution, consistency, CAP theorem trade-offs
Scalability - Horizontal scaling, caching strategies, database sharding
Data structures - OT vs CRDT, understanding algorithmic trade-offs
Product thinking - Permission models, user experience, edge cases

Key interview tips:

Start with clarifying questions - don’t assume requirements
State your assumptions explicitly - show you’re being thoughtful
Walk through data flow - demonstrate understanding of how systems connect
Discuss trade-offs - every decision has pros and cons
Consider scale - think through bottlenecks and solutions
Address edge cases - shows thoroughness and real-world thinking

The most important thing in a system design interview is communication. Explain your reasoning, involve the interviewer, and show how you think through complex problems systematically.

Google Docs took years to perfect, but understanding the core principles - conflict resolution, real-time synchronization, and scale - will serve you well in many distributed systems problems.

References

Google Wave Operational Transformation Protocol - Original OT specification https://svn.apache.org/repos/asf/incubator/wave/whitepapers/operational-transform/operational-transform.html
Conflict-Free Replicated Data Types (CRDTs) - Shapiro et al., 2011 https://hal.inria.fr/inria-00609399/document
Figma’s Multiplayer Technology - Evan Wallace (CTO) https://www.figma.com/blog/how-figmas-multiplayer-technology-works/
Yjs - CRDT Framework - Documentation and research https://docs.yjs.dev/
Automerge - CRDT Implementation - Martin Kleppmann https://automerge.org/
ShareDB - OT Framework - Real-time collaborative editing https://share.github.io/sharedb/
Google Docs Engineering Blog - Architecture insights https://workspace.google.com/blog

YouTube Videos

“Building a Collaborative Editor” - Hussein Nasser https://www.youtube.com/watch?v=bUHFg8CZFws
“How Google Docs Works” - Fireship https://www.youtube.com/watch?v=NtMvNh0WFVM
“CRDTs: The Hard Parts” - Martin Kleppmann at Hydra Conference https://www.youtube.com/watch?v=x7drE24geUw
“Operational Transformation in Real-Time Collaborative Editing” - Google I/O https://www.youtube.com/watch?v=84zqbXUQIHc
“Figma’s Multiplayer Technology Deep Dive” - Evan Wallace https://www.youtube.com/watch?v=xDuwrtwYHu8
“System Design: Google Docs” - Gaurav Sen https://www.youtube.com/watch?v=NtMvNh0WFVM
“Building Notion: Database Architecture” - Systems Design Interview https://www.youtube.com/watch?v=8mAITcNt710

System Design Interview: Collaborative Document Editor Like Google Docs

Key Takeaways

Table of Contents

Interview Framework: How to Approach This Problem

Step 1: Clarifying Requirements

Questions to Ask the Interviewer

Functional Requirements (After Clarification)

Non-Functional Requirements

Step 2: Core Assumptions and Constraints

Traffic Assumptions

Read vs Write Pattern

Scale Calculations

Technology Constraints

Step 3: High-Level Architecture

Component Overview

Component Overview

Walking Through the Data Flow

Why This Architecture?

Step 4: The Hardest Problem - Conflict Resolution

The Core Problem

Approach 1: Last Write Wins (Simple but Bad)

Approach 2: Locking (Traditional but Limiting)

Approach 3: Operational Transformation or CRDTs (Complex but Correct)

Step 5: Choosing Between OT and CRDT

Operational Transformation (OT)

Conflict-Free Replicated Data Types (CRDT)

Our Choice: Operational Transformation

Implementing OT - The Server-Side Logic

Step 6: Database Design and Storage

Data Classification

MongoDB Schema Design

Storage Tier Strategy

Step 7: Scaling the System

Bottleneck 1: Collaboration Servers (Stateful)

Bottleneck 2: Database Write Throughput

Bottleneck 3: Redis Single Point of Failure

Bottleneck 4: Geographic Distribution

Capacity Planning

Step 8: Security and Permissions

Permission Model

Permission Checking Flow

Step 9: Handling Edge Cases

Edge Case 1: Network Disconnection

Edge Case 2: Very Large Documents (100,000+ words)

Edge Case 3: Rapid Typing (100 keystrokes/second)

Edge Case 4: Malicious User Spamming Operations

Step 10: Performance Optimizations

1. Operation Compaction

2. Cursor Position Throttling

3. Smart Caching Strategy

Real-World Implementations

Google Docs Architecture

Notion’s Approach

Figma’s Multiplayer

Common Interview Follow-Up Questions

Q: How would you add offline support?

Q: How do you handle conflicts in formatting?

Q: How would you implement version history with branching?

Q: What if the server goes down while users are editing?

Q: How do you prevent data loss?

Conclusion

References

YouTube Videos

Next in Series

Related Posts

System Design Interview: Design Twitter News Feed

System Design Interview: Ultimate Cheatsheet to Crack Any Round

System Design Interview: Distributed Cache Like Redis/Memcached

Keep Learning with New Posts

Was this guide helpful?