Fix Anthropic API 429 "Too Many Requests"
A 429 from the Claude / Anthropic API means you crossed a rate limit — not
that anything is broken. This page explains the three limits that trigger it, the right
way to recover, and how to stop hitting it in the first place.
Fast answer: read the retry-after header and wait that
long, then retry with exponential backoff + jitter. To stop recurring 429s: lower
concurrency, enable prompt caching, batch non-urgent calls, and spread load across more
upstream capacity.
The three limits behind a 429
Anthropic enforces per-minute limits that scale with your usage tier. Any one of them can return a 429:
| Limit | What it caps | Typical trigger |
|---|---|---|
| RPM | Requests per minute | Many small calls — agent loops, autocomplete |
| ITPM | Input tokens per minute | Large context re-sent every turn |
| OTPM | Output tokens per minute | Long generations in parallel |
The 429 response carries a retry-after header (seconds). Honor it — it is
the server telling you exactly when the window resets.
Recover correctly: retry-after + backoff
The single most common mistake is retrying instantly in a loop, which keeps you pinned against the limit. Do this instead:
import time, random, httpx
def call_with_backoff(client, payload, max_retries=6):
delay = 1.0
for attempt in range(max_retries):
r = client.post("/v1/messages", json=payload)
if r.status_code != 429:
return r
# Prefer the server's retry-after; fall back to exponential backoff + jitter
wait = float(r.headers.get("retry-after", delay))
time.sleep(wait + random.uniform(0, 0.5))
delay = min(delay * 2, 30)
raise RuntimeError("Still rate-limited after retries")
Stop hitting it: five durable fixes
-
Lower concurrency
Cap parallel in-flight requests (e.g. a semaphore of 4–8). Bursts are what trip RPM; a steady stream rarely does.
-
Enable prompt caching
Cache stable prefixes (system prompt, project context). Cached input counts differently toward ITPM, so repetitive context stops eating your limit.
-
Batch non-urgent work
Move bulk or background jobs to an asynchronous/batch path so they don't compete with interactive traffic for the same per-minute window.
-
Trim the context you resend
Agentic tools (Cline, Cursor, Claude Code) resend context each turn. Smaller context = fewer input tokens per minute. See reducing Cursor token usage.
-
Spread load across more capacity
A single key shares one tier's limits. Routing requests across a larger pool of upstream capacity with automatic failover raises your effective ceiling — this is what a smart gateway does, and why the same workload sees fewer 429s through one.
429 in Claude Code specifically
Claude Code is bursty: it fires tool calls and follow-ups rapidly. If you see 429s mid-task, they're usually RPM, not balance. Backoff is built in, but trimming auto-read files and routing through capacity with failover both help. Setup: Claude Code with a custom endpoint.
FAQ
What does a 429 error mean on the Claude API?
You exceeded a per-minute rate limit (RPM, ITPM, or OTPM) for your usage tier. The
retry-after header says how long until the window resets.
Should I retry immediately?
No — wait for retry-after, then use exponential backoff with jitter.
Immediate retries extend the cooldown.
Does a gateway remove rate limits?
No service can remove Anthropic's limits. A gateway that load-balances across more upstream capacity with failover raises the effective ceiling, so the same workload hits 429 less often.
Fewer 429s — one endpoint, routed across capacity
$1 minimum top-up, pay per token, cancel anytime.
Sign up free → Already a member