System Design Guides

Design a Notification System

A useful system design guide for sending email, push, SMS, and in-app notifications with preferences, fanout, retries, deduplication, and provider failover.

queuesfanoutpushemailretries

Interview Prompt

Design a notification system that can send transactional and engagement notifications across push, email, SMS, and in-app channels.

Separates notification request ingestion from channel delivery workers.

Handles user preferences, quiet hours, deduplication, and templates.

Uses queues, retries, dead-letter handling, and provider failover.

Distinguishes transactional notifications from marketing or engagement notifications.

Step 1

Clarify functional and non-functional requirements first.

Functional Requirements

  • Services can request notifications using a common API.
  • Users can configure channel preferences and quiet hours.
  • The system sends push, email, SMS, and in-app notifications.
  • Templates support localization and variable substitution.
  • Delivery status and failures are visible for debugging.

Non-Functional Requirements

  • Transactional notifications should be delivered reliably and quickly.
  • Marketing notifications can be delayed, batched, or rate limited.
  • The system should avoid duplicate sends.
  • Provider outages should not lose notification requests.
  • The platform should respect compliance and unsubscribe rules.

Scale Assumptions

  • 100 million users.
  • 500 million notification requests per day.
  • Push dominates volume; SMS is expensive and lower volume.
  • Major product events can create fanout spikes.

Request QPS

~5,800/sec average

500M daily requests with peak multiples during campaigns or incidents.

Fanout

Channel dependent

One request can create multiple channel deliveries after preference evaluation.

Status writes

High volume

Each notification may generate queued, sent, delivered, opened, failed, and retried events.

Provider limits

External bottleneck

Email, SMS, and push providers impose throughput and reputation constraints.

Step 2

Identify the key entities before picking storage.

EntityFields and RelationshipsInterview Notes
NotificationRequestrequest_id, user_id, template_id, category, priority, statusDurable record for support and retries.
DeliveryAttemptrequest_id, channel, provider, attempt, status, provider_message_idTracks per-channel send lifecycle.
Preferenceuser_id, category, channel, enabled, quiet_hoursEvaluated before fanout.
Templatetemplate_id, locale, channel, body, variables, versionTemplate changes should be versioned for audits.

Step 3

Define the APIs around the user flows.

InterfaceRequest / ResponseContract Notes
POST /v1/notifications{ userId, templateId, category, priority, data, idempotencyKey }Returns accepted after durable enqueue, not after final delivery.
PUT /v1/users/{id}/notification-preferences{ category, channels, quietHours, frequency }Preference checks should be central, not copied into every caller.
POST /v1/provider-webhooks/{provider}Receives delivered, bounced, failed, or opened eventsWebhook handling must be idempotent.

Step 4

Trace the critical data flow step by step.

01

Ingest request

Caller sends a notification request with category, priority, and idempotency key. API validates and durably enqueues it.

02

Preference and policy

A router applies preferences, quiet hours, compliance rules, deduplication, and channel selection.

03

Channel queues

Push, email, SMS, and in-app deliveries go to separate queues so each can scale and retry independently.

04

Provider delivery

Workers render templates, call providers, handle rate limits, and write delivery attempts.

05

Status and analytics

Provider webhooks and client events update delivery state, open rates, and suppression lists.

Step 5

Convert the flow into a high-level design.

Final Design

Notification System final architecture

Loading Diagram

Serving Layer

Start with clients, routing, APIs, and the main synchronous path users depend on for this problem.

State Layer

Anchor the design around the key entities: NotificationRequest, DeliveryAttempt, Preference, Template.

Async Layer

Move slow, high-volume, or failure-prone work behind queues, workers, streams, caches, or background reconciliation.

Step 6

Deep dives interviewers are likely to probe.

Deduplication

  • Require caller-provided idempotency keys for transactional notifications.
  • Use content-based dedup windows for noisy engagement events.
  • Make provider webhook processing idempotent because providers retry webhooks.

Retries

  • Retry transient provider errors with exponential backoff and jitter.
  • Do not retry permanent failures such as invalid phone number or unsubscribed email.
  • Use dead-letter queues for repeated failures and operator review.

Priority

  • Password reset and fraud alerts should bypass marketing throttles.
  • Digest and promotional notifications can be delayed or dropped under load.
  • Separate queues prevent low-priority campaigns from delaying critical alerts.

Step 7

Tradeoffs to explain out loud.

Single queue vs per-channel queues

Use When

Per-channel queues isolate provider limits and retry behavior.

Watch Out

More queues require more operational dashboards and controls.

Send immediately vs digest

Use When

Digest reduces noise for low-priority engagement notifications.

Watch Out

Transactional notifications should not wait for digest windows.

Provider abstraction vs provider-specific workers

Use When

An abstraction eases failover and routing.

Watch Out

Provider features differ; hiding all details can make debugging harder.

Avoid

Common mistakes that weaken the answer.

  • Calling email or SMS providers synchronously from product services.
  • Ignoring user preferences and legal unsubscribe requirements.
  • Using one queue for password resets and marketing blasts.
  • Retrying permanent failures forever.
  • Not tracking provider message IDs for debugging.

Step 8

Follow-up questions with strong answers.

How do you avoid sending the same notification twice?

Use idempotency keys at ingest, unique constraints on request identity, dedup windows for noisy events, and idempotent provider webhook handling.

How do you handle an SMS provider outage?

Pause or reroute SMS deliveries by provider health, retry transient failures, use secondary providers for critical categories, and alert operators.

How do quiet hours work?

Evaluate user timezone and category priority. Low-priority notifications are scheduled for later; critical transactional alerts may bypass quiet hours.

Step 9

What a strong answer should signal.

Pipeline design

Uses durable ingest, routing, per-channel queues, workers, and status tracking.

Reliability

Covers retries, idempotency, dead letters, and provider outages.

User trust

Honors preferences, quiet hours, unsubscribe rules, and deduplication.

Operations

Includes delivery metrics, provider webhooks, templates, and support visibility.

Practice this problem under interview conditions.

Read the guide, then run the prompt live with LeetSys so you can practice requirements, key entities, API design, data flow, whiteboarding, tradeoff narration, and follow-up handling.

Practice Now

Related Guides