System Design Guides

Design a Distributed Rate Limiter

A focused system design guide for distributed rate limiting with token buckets, sliding windows, Redis, local caches, multi-region behavior, and abuse controls.

rate limitingredistokensabuseapi gateways

Interview Prompt

Design a distributed rate limiter that protects APIs from abusive or excessive traffic while allowing legitimate bursts.

Clarifies limit dimensions such as user, IP, API key, endpoint, and organization.

Chooses an algorithm and explains precision versus cost.

Handles distributed counters, atomic updates, and fail-open or fail-closed behavior.

Mentions configuration, observability, and client response headers.

Step 1

Clarify functional and non-functional requirements first.

Functional Requirements

  • Enforce request limits by API key, user, IP, endpoint, and plan tier.
  • Allow limited bursts while protecting backend services.
  • Return clear rate-limit headers and retry guidance.
  • Support dynamic configuration changes.
  • Expose logs and metrics for security and customer support.

Non-Functional Requirements

  • Decision latency should be under a few milliseconds at the gateway.
  • Limiter should be highly available and not become the bottleneck.
  • Accuracy can be approximate for low-risk endpoints but stricter for expensive operations.
  • The design should handle multiple gateway instances and regions.
  • Failure behavior must be explicit for critical and non-critical endpoints.

Scale Assumptions

  • 100,000 requests per second across all APIs.
  • 10 million active API keys.
  • Most keys are quiet, but some generate heavy bursts.
  • Limits vary by endpoint and account plan.

Decision QPS

100k/sec

Every request asks the limiter or a local approximation before hitting origin services.

Hot key risk

High skew

A few abusive or high-volume keys can dominate limiter writes.

Storage

TTL counters

Most limiter state expires quickly and belongs in fast in-memory stores.

Latency budget

<5 ms

Gateway rate limiting needs a tiny budget to avoid increasing p95 API latency.

Step 2

Identify the key entities before picking storage.

EntityFields and RelationshipsInterview Notes
Policypolicy_id, dimension, endpoint_pattern, limit, window, burst, priorityCached by gateways and limiter workers.
Counterkey, window_start, count, ttlUsed by fixed or sliding window algorithms.
TokenBucketkey, tokens, last_refill_atSupports bursts with controlled refill rate.
RateLimitEventsubject, endpoint, decision, policy_id, timestampUseful for debugging and abuse monitoring.

Step 3

Define the APIs around the user flows.

InterfaceRequest / ResponseContract Notes
POST /internal/rate-limit/check{ subject, endpoint, cost } -> { allowed, remaining, resetAt }Often implemented as an internal library or sidecar instead of a remote HTTP call.
PUT /v1/rate-limit-policies/{id}{ dimensions, algorithm, limit, window, burst }Configuration should propagate safely and support rollback.
GET /v1/rate-limit-eventsReturns blocked events for support and security analysisSample high-volume events to control cost.

Step 4

Trace the critical data flow step by step.

01

Gateway integration

API gateway extracts dimensions, loads matching policies, and asks a local or shared limiter before proxying.

02

Fast state store

Redis or an in-memory distributed store maintains counters or token buckets with atomic scripts and TTLs.

03

Local smoothing

Gateways can use small local token buckets to reduce shared-store calls and absorb tiny bursts.

04

Decision response

Gateway returns 429 with remaining quota, reset time, and retry-after headers when blocked.

05

Control plane

Policy service manages limits, rollout, audit history, and pushes updates to gateways.

Step 5

Convert the flow into a high-level design.

Final Design

Rate Limiter final architecture

Loading Diagram

Serving Layer

Start with clients, routing, APIs, and the main synchronous path users depend on for this problem.

State Layer

Anchor the design around the key entities: Policy, Counter, TokenBucket, RateLimitEvent.

Async Layer

Move slow, high-volume, or failure-prone work behind queues, workers, streams, caches, or background reconciliation.

Step 6

Deep dives interviewers are likely to probe.

Algorithms

  • Fixed window is simple but allows boundary bursts.
  • Sliding window is more accurate but costs more state or computation.
  • Token bucket is often a strong default because it supports controlled bursts.

Distributed accuracy

  • A single shared counter is accurate but can become a hot key.
  • Sharded counters reduce hot key pressure but are approximate until aggregated.
  • Local gateway limits are fastest but can over-allow across many gateways.

Failure modes

  • Fail-open protects availability but can expose backends during abuse.
  • Fail-closed protects critical resources but can block legitimate traffic.
  • Many systems choose behavior by endpoint risk and customer tier.

Step 7

Tradeoffs to explain out loud.

Central limiter vs gateway-local limiter

Use When

Central state improves consistency across gateway instances.

Watch Out

Central state adds network latency and can fail independently.

Token bucket vs sliding window

Use When

Token bucket is simple and handles bursts well.

Watch Out

Sliding window gives more intuitive exact per-window semantics.

Exact enforcement vs approximate enforcement

Use When

Approximate is acceptable for broad abuse reduction.

Watch Out

Billing-sensitive or expensive operations may need stricter guarantees.

Avoid

Common mistakes that weaken the answer.

  • Using one global lock or counter for all traffic.
  • Ignoring hot keys from abusive clients.
  • Forgetting rate-limit response headers.
  • Failing closed for every endpoint without considering availability impact.
  • Hardcoding limits so support and operations cannot adjust them.

Step 8

Follow-up questions with strong answers.

How do you rate limit across multiple regions?

Use regional limits for low latency and a lower-frequency global budget for coarse enforcement. Strict global limits require cross-region coordination and higher latency.

What happens if Redis is down?

Use local fallback budgets, choose fail-open or fail-closed by endpoint, emit alerts, and shed high-risk traffic with cached policies.

How do you handle one customer with many API keys?

Apply hierarchical limits: per key, per user, per organization, and per endpoint cost class.

Step 9

What a strong answer should signal.

Algorithm choice

Explains token bucket, fixed window, and sliding window tradeoffs.

Distributed systems

Handles multiple gateways, atomic updates, hot keys, and multi-region behavior.

Operations

Includes dynamic config, metrics, logs, headers, and failure policy.

Product fit

Supports different dimensions, plan tiers, and endpoint costs.

Practice this problem under interview conditions.

Read the guide, then run the prompt live with LeetSys so you can practice requirements, key entities, API design, data flow, whiteboarding, tradeoff narration, and follow-up handling.

Practice Now

Related Guides