Design a Distributed Rate Limiter: System Design Interview Guide

Interview Prompt

Design a distributed rate limiter that protects APIs from abusive or excessive traffic while allowing legitimate bursts.

Clarifies limit dimensions such as user, IP, API key, endpoint, and organization.

Chooses an algorithm and explains precision versus cost.

Handles distributed counters, atomic updates, and fail-open or fail-closed behavior.

Mentions configuration, observability, and client response headers.

Step 1

Clarify functional and non-functional requirements first.

Functional Requirements

Enforce request limits by API key, user, IP, endpoint, and plan tier.
Allow limited bursts while protecting backend services.
Return clear rate-limit headers and retry guidance.
Support dynamic configuration changes.
Expose logs and metrics for security and customer support.

Non-Functional Requirements

Decision latency should be under a few milliseconds at the gateway.
Limiter should be highly available and not become the bottleneck.
Accuracy can be approximate for low-risk endpoints but stricter for expensive operations.
The design should handle multiple gateway instances and regions.
Failure behavior must be explicit for critical and non-critical endpoints.

Scale Assumptions

100,000 requests per second across all APIs.
10 million active API keys.
Most keys are quiet, but some generate heavy bursts.
Limits vary by endpoint and account plan.

Decision QPS

100k/sec

Every request asks the limiter or a local approximation before hitting origin services.

Hot key risk

High skew

A few abusive or high-volume keys can dominate limiter writes.

Storage

TTL counters

Most limiter state expires quickly and belongs in fast in-memory stores.

Latency budget

<5 ms

Gateway rate limiting needs a tiny budget to avoid increasing p95 API latency.

Step 2

Identify the key entities before picking storage.

Entity	Fields and Relationships	Interview Notes
Policy	policy_id, dimension, endpoint_pattern, limit, window, burst, priority	Cached by gateways and limiter workers.
Counter	key, window_start, count, ttl	Used by fixed or sliding window algorithms.
TokenBucket	key, tokens, last_refill_at	Supports bursts with controlled refill rate.
RateLimitEvent	subject, endpoint, decision, policy_id, timestamp	Useful for debugging and abuse monitoring.

Step 3

Define the APIs around the user flows.

Interface	Request / Response	Contract Notes
POST /internal/rate-limit/check	{ subject, endpoint, cost } -> { allowed, remaining, resetAt }	Often implemented as an internal library or sidecar instead of a remote HTTP call.
PUT /v1/rate-limit-policies/{id}	{ dimensions, algorithm, limit, window, burst }	Configuration should propagate safely and support rollback.
GET /v1/rate-limit-events	Returns blocked events for support and security analysis	Sample high-volume events to control cost.

Step 4

Trace the critical data flow step by step.

Gateway integration

API gateway extracts dimensions, loads matching policies, and asks a local or shared limiter before proxying.

Fast state store

Redis or an in-memory distributed store maintains counters or token buckets with atomic scripts and TTLs.

Local smoothing

Gateways can use small local token buckets to reduce shared-store calls and absorb tiny bursts.

Decision response

Gateway returns 429 with remaining quota, reset time, and retry-after headers when blocked.

Control plane

Policy service manages limits, rollout, audit history, and pushes updates to gateways.

Step 5

Convert the flow into a high-level design.

Final Design

Rate Limiter final architecture

Loading Diagram

Serving Layer

Start with clients, routing, APIs, and the main synchronous path users depend on for this problem.

State Layer

Anchor the design around the key entities: Policy, Counter, TokenBucket, RateLimitEvent.

Async Layer

Move slow, high-volume, or failure-prone work behind queues, workers, streams, caches, or background reconciliation.

Step 6

Deep dives interviewers are likely to probe.

Algorithms

Fixed window is simple but allows boundary bursts.
Sliding window is more accurate but costs more state or computation.
Token bucket is often a strong default because it supports controlled bursts.

Distributed accuracy

A single shared counter is accurate but can become a hot key.
Sharded counters reduce hot key pressure but are approximate until aggregated.
Local gateway limits are fastest but can over-allow across many gateways.

Failure modes

Fail-open protects availability but can expose backends during abuse.
Fail-closed protects critical resources but can block legitimate traffic.
Many systems choose behavior by endpoint risk and customer tier.

Step 7

Tradeoffs to explain out loud.

Central limiter vs gateway-local limiter

Use When

Central state improves consistency across gateway instances.

Watch Out

Central state adds network latency and can fail independently.

Token bucket vs sliding window

Use When

Token bucket is simple and handles bursts well.

Watch Out

Sliding window gives more intuitive exact per-window semantics.

Exact enforcement vs approximate enforcement

Use When

Approximate is acceptable for broad abuse reduction.

Watch Out

Billing-sensitive or expensive operations may need stricter guarantees.

Avoid

Common mistakes that weaken the answer.

Using one global lock or counter for all traffic.
Ignoring hot keys from abusive clients.
Forgetting rate-limit response headers.
Failing closed for every endpoint without considering availability impact.
Hardcoding limits so support and operations cannot adjust them.

Step 8

Follow-up questions with strong answers.

How do you rate limit across multiple regions?

Use regional limits for low latency and a lower-frequency global budget for coarse enforcement. Strict global limits require cross-region coordination and higher latency.

What happens if Redis is down?

Use local fallback budgets, choose fail-open or fail-closed by endpoint, emit alerts, and shed high-risk traffic with cached policies.

How do you handle one customer with many API keys?

Apply hierarchical limits: per key, per user, per organization, and per endpoint cost class.

Step 9

What a strong answer should signal.

Algorithm choice

Explains token bucket, fixed window, and sliding window tradeoffs.

Distributed systems

Handles multiple gateways, atomic updates, hot keys, and multi-region behavior.

Operations

Includes dynamic config, metrics, logs, headers, and failure policy.

Product fit

Supports different dimensions, plan tiers, and endpoint costs.

Practice this problem under interview conditions.

Read the guide, then run the prompt live with LeetSys so you can practice requirements, key entities, API design, data flow, whiteboarding, tradeoff narration, and follow-up handling.

Practice Now

Related Guides

Mid-Level

Design a Distributed Rate Limiter

Clarify functional and non-functional requirements first.

Functional Requirements

Non-Functional Requirements

Scale Assumptions

Identify the key entities before picking storage.

Define the APIs around the user flows.

Trace the critical data flow step by step.

Gateway integration

Fast state store

Local smoothing

Decision response

Control plane

Convert the flow into a high-level design.

Rate Limiter final architecture

Serving Layer

State Layer

Async Layer

Deep dives interviewers are likely to probe.

Algorithms

Distributed accuracy

Failure modes

Tradeoffs to explain out loud.

Central limiter vs gateway-local limiter

Token bucket vs sliding window

Exact enforcement vs approximate enforcement

Common mistakes that weaken the answer.

Follow-up questions with strong answers.

How do you rate limit across multiple regions?

What happens if Redis is down?

How do you handle one customer with many API keys?

What a strong answer should signal.

Practice this problem under interview conditions.

Related Guides

URL Shortener

Search Autocomplete

Notification System