Skip to content
ADevGuide Logo ADevGuide
Go back

System Design Interview: Basic Notification System (Email & Push)

By Pratik Bhuite | 30 min read

Hub: System Design / Interview Questions

Series: System Design Interview Series

Last verified: Feb 11, 2026

Part 3 of 8 in the System Design Interview Series

Key Takeaways

On this page
Reading Comfort:

System Design Interview: Basic Notification System

This post walks you through designing a basic yet robust notification system in a system design interview. We’ll cover everything from initial requirements to scaling for millions of users, focusing on email and push notifications for events like password resets or comment replies.

Need a quick revision before interviews? Read the companion cheat sheet: System Design Interview: Notification System Design Interview CheatSheet.

Table of Contents

Open Table of Contents

Interview Framework: How to Approach This Problem

When an interviewer asks you to design a notification system, they’re testing your ability to build a reliable, scalable, and decoupled architecture. Here’s a structured approach to impress them:

  1. Clarify requirements (5 min) – Don’t assume. Ask about channels, volume, and delivery guarantees.
  2. State assumptions (2 min) – Define the scale (e.g., 1M users, 400k notifications/day) and latency targets.
  3. High-level design (10 min) – Sketch the main components. A message queue is non-negotiable here.
  4. Deep dive (20 min) – Focus on the hardest problem: reliable delivery. Discuss retry logic, backoff strategies, and dead-letter queues.
  5. Scale and optimize (10 min) – Identify bottlenecks (e.g., queue overload, database writes) and propose solutions.
  6. Edge cases (3 min) – Show thoroughness by considering failures, invalid inputs, and user preferences.

Key mindset: Think out loud. Explain the “why” behind your decisions and discuss trade-offs. For example, “I’m choosing a message queue here to decouple the event producers from the notification workers. This makes the system more resilient; if the notification service is down, we don’t lose events.”

Step 1: Clarifying Requirements

“Before I design anything, I need to understand the exact requirements. Let me ask a few clarifying questions.”

Questions to Ask the Interviewer

  • Channels: Which notification channels must we support? (e.g., Email, Push, SMS?) Let’s start with Email and Push.
  • Triggers: What specific events trigger these notifications? (e.g., Password Reset, Comment Reply, New Follower).
  • Timeliness: Should notifications be sent instantly, or is some delay acceptable? For password resets, it must be near-instant. For comment replies, a minute is fine.
  • Delivery Guarantees: What level of reliability is needed? At-least-once delivery is critical. We can’t lose a password reset email.
  • User Preferences: Do users need to control which notifications they receive on which channel? Yes, this is a key feature.
  • Scale: What is the expected scale? How many users and how many notifications per day? Let’s assume 1 million users sending 400,000 notifications/day.
  • Compliance: Are there any legal requirements to consider, like GDPR or CAN-SPAM for emails? Yes, we must support unsubscribes.

Functional Requirements

Based on the conversation, here are our core features:

  1. Send notifications via Email and Push channels.
  2. Support multiple event types (Password Reset, Comment Reply).
  3. Allow users to manage their notification preferences.
  4. Automatically retry sending failed notifications.
  5. Provide a mechanism for users to unsubscribe from marketing emails (CAN-SPAM compliance).
  6. Log the status of every notification for debugging and auditing.

Non-Functional Requirements

  1. Reliability: At-least-once delivery. No notifications should be lost. Aim for a 99.9% delivery success rate.
  2. Scalability: The system must handle traffic spikes, such as during a site-wide announcement. It should scale to millions of users.
  3. Latency: Transactional notifications (password resets) should be delivered in under 10 seconds. Other notifications within 1 minute.
  4. Durability: Notification data and logs should be safely stored and not lost.

Step 2: Core Assumptions and Constraints

“Great, now I’ll make some explicit assumptions about scale to guide my design.”

Traffic Assumptions

  • Total Users: 1 million
  • Daily Active Users (DAU): 200,000 (20% of total)
  • Average Notifications per DAU: 2
  • Total Notifications per Day: 400,000
  • Peak Traffic: Assume peak is 5x the average. 400,000 / (24 * 60) ≈ 277 notifications/minute. Peak ≈ 1,400 notifications/minute.

Scale Calculations

Storage Estimation (for logs):

  • Assume each log entry is 250 bytes.
  • 400,000 notifications/day * 250 bytes/notification = 100 MB/day.
  • Per Year: 100 MB * 365 = ~36.5 GB. This is manageable for a standard database.

Bandwidth Estimation (at peak):

  • Assume average notification payload is 2 KB.
  • 1,400 notifications/min * 2 KB/notification = 2.8 MB/min. This is very low bandwidth.

Technology Constraints

  • Message Queue: We must use a message queue (like RabbitMQ, AWS SQS, or Kafka) to decouple services. This is the cornerstone of a reliable notification system.
  • External Services: We will rely on third-party services for the actual delivery, like SendGrid for email and Firebase Cloud Messaging (FCM) for push notifications. We are not building an SMTP server.
  • Database: A relational database like PostgreSQL is suitable for storing logs and user preferences due to its reliability and query capabilities.

Step 3: High-Level Architecture

“Let me start with a high-level architecture. The core of this design is a decoupled system using a message queue.”

System Flow Diagram

flowchart TD
    Client(Client Layer) --> Producer[Event Producer<br/>User Service]
    Producer --> Queue[Message Queue<br/>RabbitMQ/Kafka]
    Queue --> Worker[Notification Worker]

    Worker --> LogDB[(Log DB)]
    Worker --> DLQ[(Dead Letter Queue)]
    Worker --> Email[Email Service<br/>SendGrid]
    Worker --> Push[Push Service<br/>FCM/APNS]

    Email --> User((User))
    Push --> User

    classDef service fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000;
    classDef storage fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000000;
    classDef client fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,rx:10,ry:10,color:#000000;
    classDef infrastructure fill:#f5f5f5,stroke:#616161,stroke-width:2px,color:#000000;

    class Client,User client;
    class Producer,Worker,Email,Push service;
    class LogDB,DLQ storage;
    class Queue infrastructure;

Data Flow

  1. Event Occurs: A service (e.g., AuthService) triggers an event like PasswordReset.
  2. Event Producer: The service acts as a producer, creating a message with details like userId, eventType, and payload.
  3. Message Queue: The producer sends this message to a message queue (e.g., an SQS queue named notification_jobs).
  4. Notification Worker: A pool of workers consumes messages from the queue. For each message, it: a. Fetches the user’s notification preferences from the database. b. Determines the right channel (Email, Push, or both). c. Calls the appropriate external service (SendGrid for email, FCM for push).
  5. Logging: The worker logs the status (SENT, FAILED) to a notification_logs table in the database.
  6. Failure Handling: If delivery fails, the worker can retry or move the message to a Dead Letter Queue (DLQ) for manual inspection.

Why This Architecture?

  • Decoupling: The services that produce events don’t need to know anything about how notifications are sent. They just fire an event and move on.
  • Reliability: If the NotificationWorker is down, messages pile up in the queue and are processed when it comes back online. No events are lost.
  • Scalability: If there’s a surge in notifications, we can simply scale up the number of NotificationWorker instances to process the queue faster.

Step 4: The Hardest Problem - Reliable Delivery & Retry

“The most critical part of this system is ensuring reliable delivery. Let’s discuss how to handle failures.”

Problem Statement

External services are not 100% reliable.

  • An email provider’s API might be down.
  • A push notification service might be experiencing high latency.
  • A user might have provided an invalid email address.

We need a robust retry mechanism.

Approaches

Approach 1: Fire-and-Forget (Not an option)

  • Send the notification once. If it fails, do nothing.
  • Problem: Unacceptable for transactional notifications. We would lose password reset emails.
  • Verdict: ❌ Not a viable solution.

Approach 2: Synchronous Retry

  • If sending fails, immediately retry in a loop within the same worker process.
  • Problem: This blocks the worker. If a third-party service is down for minutes, the worker will be stuck retrying and won’t process other notifications from the queue.
  • Verdict: ❌ Poorly scalable and not resilient.

Approach 3: Asynchronous Retry with Exponential Backoff (The Correct Approach)

  • If sending fails, re-queue the message with a delay. Increase the delay for each subsequent failure.
  • How it works:
    1. Worker tries to send a notification. It fails.
    2. The worker sends the message back to the queue but with a delay property (e.g., 1 minute).
    3. After 1 minute, the message becomes visible again. Another worker picks it up.
    4. If it fails again, re-queue with a longer delay (e.g., 5 minutes, then 15 minutes). This is exponential backoff.
    5. After a certain number of retries (e.g., 5), move the message to a Dead Letter Queue (DLQ).
  • Verdict: ✅ This is the industry-standard, resilient, and scalable approach.

Implementation Example (Pseudocode):

// Pseudocode for a notification worker
async function processNotification(message) {
  try {
    await sendViaProvider(message);
    logStatus(message.id, "SUCCESS");
  } catch (error) {
    if (message.retryCount < MAX_RETRIES) {
      // Re-queue with exponential backoff
      const delay = calculateBackoff(message.retryCount);
      requeueMessage(message, delay);
    } else {
      // Move to Dead Letter Queue
      moveToDLQ(message);
      logStatus(message.id, "FAILED_PERMANENTLY");
    }
  }
}

Step 5: Choosing Delivery Channels & External Services

We won’t build our own email or push infrastructure. We’ll use third-party APIs.

  • Email: Use a transactional email service like AWS SES, SendGrid, or Postmark.
    • Why? They handle the complexities of email deliverability, such as IP reputation, DKIM/SPF records, and unsubscribe links. Building this ourselves is a massive undertaking.
  • Push Notifications: Use Firebase Cloud Messaging (FCM) for Android/Web and Apple Push Notification Service (APNs) for iOS.
    • Why? These are the native, required platforms for their respective operating systems. We can create a simple abstraction layer in our NotificationWorker to handle both.

Step 6: Database Design and Storage

We need to store two main things: user preferences and notification logs.

Data Classification

  • Hot Data: User preferences (read frequently by workers). This is a good candidate for caching.
  • Warm Data: Recent notification logs (last 30 days), queried for debugging.
  • Cold Data: Logs older than 30 days can be archived to cheaper storage like Amazon S3.

Schema Design

A relational database like PostgreSQL is a good choice.

user_notification_preferences table:

CREATE TABLE user_notification_preferences (
  user_id INT PRIMARY KEY,
  email_enabled BOOLEAN DEFAULT TRUE,
  push_enabled BOOLEAN DEFAULT TRUE,
  -- Add columns for specific event types if needed
  -- e.g., comment_reply_email_enabled BOOLEAN
  updated_at TIMESTAMP
);

notification_logs table:

CREATE TABLE notification_logs (
  id UUID PRIMARY KEY,
  user_id INT NOT NULL,
  event_type VARCHAR(50) NOT NULL,
  channel VARCHAR(20) NOT NULL, -- 'EMAIL' or 'PUSH'
  status VARCHAR(20) NOT NULL, -- 'PENDING', 'SENT', 'FAILED'
  created_at TIMESTAMP NOT NULL,
  retry_count INT DEFAULT 0,
  error_message TEXT
);
  • Why UUID for id? This allows us to generate the ID on the client side, which helps with idempotency.

Storage Tier Strategy

  • Hot: User preferences can be cached in Redis for fast lookups by the workers.
  • Warm: The PostgreSQL database holds recent logs.
  • Cold: A background job can run monthly to move logs older than 30 days from PostgreSQL to Amazon S3 for long-term, low-cost storage.

Step 7: Scaling the System

“Our current design is scalable, but let’s identify potential bottlenecks as we grow to millions of users.”

Bottlenecks & Solutions

  1. Message Queue Overload:

    • Problem: A massive event (e.g., a site-wide announcement) could flood the queue.
    • Solution:
      • Multiple Queues: Use separate queues for high-priority (transactional) and low-priority (marketing) notifications.
      • Auto-scaling Workers: Use a container orchestration system like Kubernetes to automatically scale the number of NotificationWorker pods based on queue depth.
  2. Throttling by External Services:

    • Problem: Email and push providers have rate limits. Sending too many requests too quickly will result in errors.
    • Solution: Implement rate limiting in the NotificationWorker. Before calling an external API, check a Redis-based counter to ensure we are within the provider’s limits.
  3. Database Write Load:

    • Problem: Logging every notification status creates a high volume of writes to the database.
    • Solution:
      • Batch Writes: Instead of writing one log entry at a time, the worker can batch them in memory and write 100 at a time.
      • Async Logging: For non-critical logs, write to a separate, high-throughput logging queue first, and have another set of workers handle the database writes.

Capacity Planning

  • Peak Load: 1,400 notifications/minute.
  • Worker Capacity: Assume one worker can process 100 notifications/minute.
  • Required Workers: 1400 / 100 = 14 workers.
  • With Redundancy: To be safe, we should run at least 2x this number, so ~28-30 workers during peak times.

Step 8: Security and Permissions

“Security is crucial. Here’s how we’ll secure the system.”

  • Authentication: Event producers must be authenticated and authorized to publish messages to the queue. This can be done using IAM roles (in AWS) or service-to-service tokens.
  • Input Validation: The NotificationWorker must validate all incoming messages to prevent malformed data from crashing the system.
  • Preventing Abuse: Implement rate limiting on the event producer side. A single user should not be able to trigger thousands of notifications (e.g., by repeatedly triggering a password reset).
  • PII Protection: Be careful not to log Personally Identifiable Information (PII) in plain text. For example, don’t log the user’s email address in the notification_logs table; use user_id instead.

Permission Model:

// Example permissions for services
const PERMISSIONS = {
  AUTH_SERVICE: ["produce:password_reset_event"],
  COMMENT_SERVICE: ["produce:comment_reply_event"],
  NOTIFICATION_WORKER: ["consume:notification_jobs"],
};

Step 9: Handling Edge Cases

“A robust system handles edge cases gracefully. Here are a few to consider.”

  1. Invalid Email Address or Push Token:
    • Scenario: A user signs up with a typo in their email.
    • Approach: Our email service (like SendGrid) will notify us of a “bounce.” We should listen for these webhooks, mark the user’s email as invalid in our database, and stop sending them emails.
  2. Duplicate Notifications:
    • Scenario: A network error causes a producer to send the same event twice.
    • Approach: Idempotency. The producer should generate a unique ID (idempotency key) for each event. The NotificationWorker can use a Redis cache to keep track of recently processed event IDs. If it sees an ID it has already processed, it ignores the duplicate.
  3. User Unsubscribes:
    • Scenario: A user clicks “unsubscribe” from an email.
    • Approach: The link should lead to a page that updates their preferences in the user_notification_preferences table. The NotificationWorker must check these preferences before sending any notification.

Step 10: Performance Optimizations

“Here are a few optimizations to improve latency and reduce cost.”

  1. Batching: For non-urgent notifications (e.g., “You have 5 new likes”), we can batch them. Instead of sending 5 separate push notifications, a background job can run every 5 minutes, aggregate them, and send a single push: “You have 5 new likes.”
  2. Caching User Preferences: The NotificationWorker will read user preferences for every single notification. Caching this data in Redis with a short TTL (Time to Live) will significantly reduce database load.
  3. Connection Pooling: The NotificationWorker should maintain a persistent pool of connections to the database and external services to avoid the overhead of establishing new connections for each message.

Real-World Implementations

Slack Notification System

  • What they use: A distributed message queue (similar to Kafka) and a real-time delivery system.
  • Key Innovations: They use a complex system to batch notifications intelligently. If you get 10 mentions in a minute, you get one push notification, not ten. They also have sophisticated user-level controls for notification silencing.

Netflix Email Delivery

  • What they use: AWS Simple Email Service (SES) at a massive scale.
  • Key Innovations: They have a dedicated “Messaging Engineering” team that focuses on deliverability, A/B testing subject lines, and personalizing email content at scale. They treat email as a core product feature.

Common Interview Follow-Up Questions

Q: How would you add SMS notifications?

Answer: “I’d keep the same event pipeline and add SMS as another delivery adapter:

  1. Add SMS as a channel type in templates and user preferences.
  2. Normalize payloads so workers can render channel-specific content.
  3. Add quiet hours and country-specific compliance checks (for example opt-in rules).
  4. Track per-provider delivery receipts for observability and retries.

This reuses existing architecture and avoids branching business logic per channel.”

Q: What if a provider like SendGrid is down for an hour?

Answer: “I’d handle this with automated failover:

  1. Detect elevated 5xx/timeouts via rolling error-rate SLOs.
  2. Trip a circuit breaker for the failing provider.
  3. Route new traffic to a secondary provider and keep retrying failed jobs with backoff.
  4. Replay dead-letter queue items after provider recovery.

Trade-off: Multi-provider integration costs more engineering time, but protects delivery SLOs during outages.”

Q: How would you ensure the order of notifications?

Answer: “I would enforce ordering only where business logic requires it:

  1. Partition messages by userId or conversationId.
  2. Use FIFO queues for these partitions and standard queues for everything else.
  3. Include monotonic sequence numbers to detect out-of-order delivery.
  4. Drop or defer stale sequence numbers at the worker.

Trade-off: FIFO lowers throughput, so we scope it to order-sensitive notification types.”

Q: How would you implement idempotency to prevent duplicate notifications?

Answer: “I would make deduplication explicit:

  1. Create an idempotency key from (eventType, userId, objectId, timeBucket).
  2. Write this key to Redis with SETNX before send.
  3. If key exists, skip send and mark as duplicate in logs.
  4. Keep a longer-lived dedup table in storage for audit-critical events.

This prevents both queue retries and producer retries from creating duplicate sends.”

Q: How do you prevent notification fatigue while keeping important alerts reliable?

Answer: “I’d introduce user-centric throttling and prioritization:

  1. Classify notifications as critical, important, or promotional.
  2. Never batch critical alerts, but digest low-priority events into hourly/daily summaries.
  3. Enforce per-user and per-channel rate limits.
  4. Let users configure preferences and honor quiet hours.

Trade-off: Digests reduce noise and unsubscribe risk, but may slightly delay non-urgent engagement events.”

Conclusion

Key Takeaways:

  • Decouple Everything: The core principle is to use a message queue to decouple event producers from notification consumers.
  • Embrace Failure: Assume external services will fail and design a robust retry mechanism with exponential backoff and a dead-letter queue.
  • Don’t Reinvent the Wheel: Use third-party services for email (SendGrid, SES) and push (FCM, APNs).
  • Think About Scale: Identify and address bottlenecks related to queue depth, database writes, and external API rate limits.
  • User is King: Always check user preferences before sending a notification.

Interview Tips:

  • Start by clarifying requirements. It shows you’re a thoughtful engineer.
  • Draw the high-level architecture first, then dive deep into the most complex parts.
  • Always explain the “why” behind your technology choices (e.g., “I’m using a message queue because…”).
  • Mention real-world companies (like Slack or Netflix) to show you understand how these systems work in practice.

References

  1. Designing Notification Systems - AWS Architecture Blog
  2. Kafka in Facebook Notifications
  3. Reliable Email Delivery - Netflix Tech Blog
  4. Firebase Cloud Messaging Docs
  5. AWS Simple Email Service Docs
  6. RabbitMQ Reliability Guide

YouTube Videos

  1. “Notification System Design” - Gaurav Sen [https://www.youtube.com/watch?v=FU4WlwfS3G0]
  2. “How Facebook Sends Billions of Notifications” - Tech Dummies [https://www.youtube.com/watch?v=bUHFg8CZFws]
  3. “Reliable Email Delivery” - Netflix Tech Talks [https://www.youtube.com/watch?v=NtMvNh0WFVM]
  4. “Push Notification Architecture” - ByteByteGo [https://www.youtube.com/watch?v=xDuwrtwYHu8]
  5. “System Design Interview: Notification System” - Success in Tech [https://www.youtube.com/watch?v=NtMvNh0WFVM]

Share this post on:

Next in Series

Continue through the System Design Interview Series with the next recommended article.

Related Posts

Keep Learning with New Posts

Subscribe through RSS and follow the project to get new series updates.

Was this guide helpful?

Share detailed feedback

Previous Post
System Design Interview: Design Instagram Feed
Next Post
System Design Interview: Collaborative Document Editor Like Google Docs