Interview Prompt
Design a notification system that can send transactional and engagement notifications across push, email, SMS, and in-app channels.
Separates notification request ingestion from channel delivery workers.
Handles user preferences, quiet hours, deduplication, and templates.
Uses queues, retries, dead-letter handling, and provider failover.
Distinguishes transactional notifications from marketing or engagement notifications.
Step 1
Clarify functional and non-functional requirements first.
Functional Requirements
- Services can request notifications using a common API.
- Users can configure channel preferences and quiet hours.
- The system sends push, email, SMS, and in-app notifications.
- Templates support localization and variable substitution.
- Delivery status and failures are visible for debugging.
Non-Functional Requirements
- Transactional notifications should be delivered reliably and quickly.
- Marketing notifications can be delayed, batched, or rate limited.
- The system should avoid duplicate sends.
- Provider outages should not lose notification requests.
- The platform should respect compliance and unsubscribe rules.
Scale Assumptions
- 100 million users.
- 500 million notification requests per day.
- Push dominates volume; SMS is expensive and lower volume.
- Major product events can create fanout spikes.
Request QPS
~5,800/sec average
500M daily requests with peak multiples during campaigns or incidents.
Fanout
Channel dependent
One request can create multiple channel deliveries after preference evaluation.
Status writes
High volume
Each notification may generate queued, sent, delivered, opened, failed, and retried events.
Provider limits
External bottleneck
Email, SMS, and push providers impose throughput and reputation constraints.
Step 2
Identify the key entities before picking storage.
| Entity | Fields and Relationships | Interview Notes |
|---|---|---|
| NotificationRequest | request_id, user_id, template_id, category, priority, status | Durable record for support and retries. |
| DeliveryAttempt | request_id, channel, provider, attempt, status, provider_message_id | Tracks per-channel send lifecycle. |
| Preference | user_id, category, channel, enabled, quiet_hours | Evaluated before fanout. |
| Template | template_id, locale, channel, body, variables, version | Template changes should be versioned for audits. |
Step 3
Define the APIs around the user flows.
| Interface | Request / Response | Contract Notes |
|---|---|---|
| POST /v1/notifications | { userId, templateId, category, priority, data, idempotencyKey } | Returns accepted after durable enqueue, not after final delivery. |
| PUT /v1/users/{id}/notification-preferences | { category, channels, quietHours, frequency } | Preference checks should be central, not copied into every caller. |
| POST /v1/provider-webhooks/{provider} | Receives delivered, bounced, failed, or opened events | Webhook handling must be idempotent. |
Step 4
Trace the critical data flow step by step.
Ingest request
Caller sends a notification request with category, priority, and idempotency key. API validates and durably enqueues it.
Preference and policy
A router applies preferences, quiet hours, compliance rules, deduplication, and channel selection.
Channel queues
Push, email, SMS, and in-app deliveries go to separate queues so each can scale and retry independently.
Provider delivery
Workers render templates, call providers, handle rate limits, and write delivery attempts.
Status and analytics
Provider webhooks and client events update delivery state, open rates, and suppression lists.
Step 5
Convert the flow into a high-level design.
Final Design
Notification System final architecture
Serving Layer
Start with clients, routing, APIs, and the main synchronous path users depend on for this problem.
State Layer
Anchor the design around the key entities: NotificationRequest, DeliveryAttempt, Preference, Template.
Async Layer
Move slow, high-volume, or failure-prone work behind queues, workers, streams, caches, or background reconciliation.
Step 6
Deep dives interviewers are likely to probe.
Deduplication
- Require caller-provided idempotency keys for transactional notifications.
- Use content-based dedup windows for noisy engagement events.
- Make provider webhook processing idempotent because providers retry webhooks.
Retries
- Retry transient provider errors with exponential backoff and jitter.
- Do not retry permanent failures such as invalid phone number or unsubscribed email.
- Use dead-letter queues for repeated failures and operator review.
Priority
- Password reset and fraud alerts should bypass marketing throttles.
- Digest and promotional notifications can be delayed or dropped under load.
- Separate queues prevent low-priority campaigns from delaying critical alerts.
Step 7
Tradeoffs to explain out loud.
Single queue vs per-channel queues
Use When
Per-channel queues isolate provider limits and retry behavior.
Watch Out
More queues require more operational dashboards and controls.
Send immediately vs digest
Use When
Digest reduces noise for low-priority engagement notifications.
Watch Out
Transactional notifications should not wait for digest windows.
Provider abstraction vs provider-specific workers
Use When
An abstraction eases failover and routing.
Watch Out
Provider features differ; hiding all details can make debugging harder.
Avoid
Common mistakes that weaken the answer.
- Calling email or SMS providers synchronously from product services.
- Ignoring user preferences and legal unsubscribe requirements.
- Using one queue for password resets and marketing blasts.
- Retrying permanent failures forever.
- Not tracking provider message IDs for debugging.
Step 8
Follow-up questions with strong answers.
How do you avoid sending the same notification twice?
Use idempotency keys at ingest, unique constraints on request identity, dedup windows for noisy events, and idempotent provider webhook handling.
How do you handle an SMS provider outage?
Pause or reroute SMS deliveries by provider health, retry transient failures, use secondary providers for critical categories, and alert operators.
How do quiet hours work?
Evaluate user timezone and category priority. Low-priority notifications are scheduled for later; critical transactional alerts may bypass quiet hours.
Step 9
What a strong answer should signal.
Pipeline design
Uses durable ingest, routing, per-channel queues, workers, and status tracking.
Reliability
Covers retries, idempotency, dead letters, and provider outages.
User trust
Honors preferences, quiet hours, unsubscribe rules, and deduplication.
Operations
Includes delivery metrics, provider webhooks, templates, and support visibility.
Practice this problem under interview conditions.
Read the guide, then run the prompt live with LeetSys so you can practice requirements, key entities, API design, data flow, whiteboarding, tradeoff narration, and follow-up handling.
Related Guides
Mid-Level
URL Shortener
A practical system design interview guide for building a URL shortener with redirects, custom aliases, analytics, rate limits, and high availability.
Senior
News Feed
A complete system design guide for building a personalized social news feed with fanout, ranking, privacy, and timeline freshness tradeoffs.
Senior
Ticketing Platform
A high-signal system design guide for Ticketmaster-style ticket sales with inventory holds, queues, payment, seat maps, and flash-sale traffic.