Interview Prompt
Design a distributed rate limiter that protects APIs from abusive or excessive traffic while allowing legitimate bursts.
Clarifies limit dimensions such as user, IP, API key, endpoint, and organization.
Chooses an algorithm and explains precision versus cost.
Handles distributed counters, atomic updates, and fail-open or fail-closed behavior.
Mentions configuration, observability, and client response headers.
Step 1
Clarify functional and non-functional requirements first.
Functional Requirements
- Enforce request limits by API key, user, IP, endpoint, and plan tier.
- Allow limited bursts while protecting backend services.
- Return clear rate-limit headers and retry guidance.
- Support dynamic configuration changes.
- Expose logs and metrics for security and customer support.
Non-Functional Requirements
- Decision latency should be under a few milliseconds at the gateway.
- Limiter should be highly available and not become the bottleneck.
- Accuracy can be approximate for low-risk endpoints but stricter for expensive operations.
- The design should handle multiple gateway instances and regions.
- Failure behavior must be explicit for critical and non-critical endpoints.
Scale Assumptions
- 100,000 requests per second across all APIs.
- 10 million active API keys.
- Most keys are quiet, but some generate heavy bursts.
- Limits vary by endpoint and account plan.
Decision QPS
100k/sec
Every request asks the limiter or a local approximation before hitting origin services.
Hot key risk
High skew
A few abusive or high-volume keys can dominate limiter writes.
Storage
TTL counters
Most limiter state expires quickly and belongs in fast in-memory stores.
Latency budget
<5 ms
Gateway rate limiting needs a tiny budget to avoid increasing p95 API latency.
Step 2
Identify the key entities before picking storage.
| Entity | Fields and Relationships | Interview Notes |
|---|---|---|
| Policy | policy_id, dimension, endpoint_pattern, limit, window, burst, priority | Cached by gateways and limiter workers. |
| Counter | key, window_start, count, ttl | Used by fixed or sliding window algorithms. |
| TokenBucket | key, tokens, last_refill_at | Supports bursts with controlled refill rate. |
| RateLimitEvent | subject, endpoint, decision, policy_id, timestamp | Useful for debugging and abuse monitoring. |
Step 3
Define the APIs around the user flows.
| Interface | Request / Response | Contract Notes |
|---|---|---|
| POST /internal/rate-limit/check | { subject, endpoint, cost } -> { allowed, remaining, resetAt } | Often implemented as an internal library or sidecar instead of a remote HTTP call. |
| PUT /v1/rate-limit-policies/{id} | { dimensions, algorithm, limit, window, burst } | Configuration should propagate safely and support rollback. |
| GET /v1/rate-limit-events | Returns blocked events for support and security analysis | Sample high-volume events to control cost. |
Step 4
Trace the critical data flow step by step.
Gateway integration
API gateway extracts dimensions, loads matching policies, and asks a local or shared limiter before proxying.
Fast state store
Redis or an in-memory distributed store maintains counters or token buckets with atomic scripts and TTLs.
Local smoothing
Gateways can use small local token buckets to reduce shared-store calls and absorb tiny bursts.
Decision response
Gateway returns 429 with remaining quota, reset time, and retry-after headers when blocked.
Control plane
Policy service manages limits, rollout, audit history, and pushes updates to gateways.
Step 5
Convert the flow into a high-level design.
Final Design
Rate Limiter final architecture
Serving Layer
Start with clients, routing, APIs, and the main synchronous path users depend on for this problem.
State Layer
Anchor the design around the key entities: Policy, Counter, TokenBucket, RateLimitEvent.
Async Layer
Move slow, high-volume, or failure-prone work behind queues, workers, streams, caches, or background reconciliation.
Step 6
Deep dives interviewers are likely to probe.
Algorithms
- Fixed window is simple but allows boundary bursts.
- Sliding window is more accurate but costs more state or computation.
- Token bucket is often a strong default because it supports controlled bursts.
Distributed accuracy
- A single shared counter is accurate but can become a hot key.
- Sharded counters reduce hot key pressure but are approximate until aggregated.
- Local gateway limits are fastest but can over-allow across many gateways.
Failure modes
- Fail-open protects availability but can expose backends during abuse.
- Fail-closed protects critical resources but can block legitimate traffic.
- Many systems choose behavior by endpoint risk and customer tier.
Step 7
Tradeoffs to explain out loud.
Central limiter vs gateway-local limiter
Use When
Central state improves consistency across gateway instances.
Watch Out
Central state adds network latency and can fail independently.
Token bucket vs sliding window
Use When
Token bucket is simple and handles bursts well.
Watch Out
Sliding window gives more intuitive exact per-window semantics.
Exact enforcement vs approximate enforcement
Use When
Approximate is acceptable for broad abuse reduction.
Watch Out
Billing-sensitive or expensive operations may need stricter guarantees.
Avoid
Common mistakes that weaken the answer.
- Using one global lock or counter for all traffic.
- Ignoring hot keys from abusive clients.
- Forgetting rate-limit response headers.
- Failing closed for every endpoint without considering availability impact.
- Hardcoding limits so support and operations cannot adjust them.
Step 8
Follow-up questions with strong answers.
How do you rate limit across multiple regions?
Use regional limits for low latency and a lower-frequency global budget for coarse enforcement. Strict global limits require cross-region coordination and higher latency.
What happens if Redis is down?
Use local fallback budgets, choose fail-open or fail-closed by endpoint, emit alerts, and shed high-risk traffic with cached policies.
How do you handle one customer with many API keys?
Apply hierarchical limits: per key, per user, per organization, and per endpoint cost class.
Step 9
What a strong answer should signal.
Algorithm choice
Explains token bucket, fixed window, and sliding window tradeoffs.
Distributed systems
Handles multiple gateways, atomic updates, hot keys, and multi-region behavior.
Operations
Includes dynamic config, metrics, logs, headers, and failure policy.
Product fit
Supports different dimensions, plan tiers, and endpoint costs.
Practice this problem under interview conditions.
Read the guide, then run the prompt live with LeetSys so you can practice requirements, key entities, API design, data flow, whiteboarding, tradeoff narration, and follow-up handling.
Related Guides
Mid-Level
URL Shortener
A practical system design interview guide for building a URL shortener with redirects, custom aliases, analytics, rate limits, and high availability.
Mid-Level
Search Autocomplete
A system design interview guide for typeahead search suggestions with tries, prefix indexes, ranking, freshness, personalization, and abuse filtering.
Mid-Level
Notification System
A useful system design guide for sending email, push, SMS, and in-app notifications with preferences, fanout, retries, deduplication, and provider failover.