Should I retry immediately after a 429?

No. Read the retry-after header and wait that long, then use exponential backoff with jitter for repeated failures. Immediate retries make the limit worse and can extend the cooldown.

Does prompt caching help with rate limits?

Yes. Cached prompt prefixes are billed and counted differently, lowering the input tokens that count toward ITPM, which reduces how quickly you hit the limit on repetitive context.

Fix Anthropic API 429 "Too Many Requests"

Q: What does a 429 error mean on the Claude API?

A 429 means you exceeded an Anthropic rate limit — requests per minute (RPM), input tokens per minute (ITPM), or output tokens per minute (OTPM) for your usage tier. The response includes a retry-after header telling you how long to wait.

A 429 from the Claude / Anthropic API means you crossed a rate limit — not that anything is broken. This page explains the three limits that trigger it, the right way to recover, and how to stop hitting it in the first place.

Fast answer: read the retry-after header and wait that long, then retry with exponential backoff + jitter. To stop recurring 429s: lower concurrency, enable prompt caching, batch non-urgent calls, and spread load across more upstream capacity.

The three limits behind a 429

Anthropic enforces per-minute limits that scale with your usage tier. Any one of them can return a 429:

Limit	What it caps	Typical trigger
RPM	Requests per minute	Many small calls — agent loops, autocomplete
ITPM	Input tokens per minute	Large context re-sent every turn
OTPM	Output tokens per minute	Long generations in parallel

The 429 response carries a retry-after header (seconds). Honor it — it is the server telling you exactly when the window resets.

Recover correctly: retry-after + backoff

The single most common mistake is retrying instantly in a loop, which keeps you pinned against the limit. Do this instead:

import time, random, httpx

def call_with_backoff(client, payload, max_retries=6):
    delay = 1.0
    for attempt in range(max_retries):
        r = client.post("/v1/messages", json=payload)
        if r.status_code != 429:
            return r
        # Prefer the server's retry-after; fall back to exponential backoff + jitter
        wait = float(r.headers.get("retry-after", delay))
        time.sleep(wait + random.uniform(0, 0.5))
        delay = min(delay * 2, 30)
    raise RuntimeError("Still rate-limited after retries")

Stop hitting it: five durable fixes

Lower concurrency

Cap parallel in-flight requests (e.g. a semaphore of 4–8). Bursts are what trip RPM; a steady stream rarely does.
Enable prompt caching

Cache stable prefixes (system prompt, project context). Cached input counts differently toward ITPM, so repetitive context stops eating your limit.
Batch non-urgent work

Move bulk or background jobs to an asynchronous/batch path so they don't compete with interactive traffic for the same per-minute window.
Trim the context you resend

Agentic tools (Cline, Cursor, Claude Code) resend context each turn. Smaller context = fewer input tokens per minute. See reducing Cursor token usage.
Spread load across more capacity

A single key shares one tier's limits. Routing requests across a larger pool of upstream capacity with automatic failover raises your effective ceiling — this is what a smart gateway does, and why the same workload sees fewer 429s through one.

429 in Claude Code specifically

Claude Code is bursty: it fires tool calls and follow-ups rapidly. If you see 429s mid-task, they're usually RPM, not balance. Backoff is built in, but trimming auto-read files and routing through capacity with failover both help. Setup: Claude Code with a custom endpoint.

FAQ

What does a 429 error mean on the Claude API?

You exceeded a per-minute rate limit (RPM, ITPM, or OTPM) for your usage tier. The retry-after header says how long until the window resets.

Should I retry immediately?

No — wait for retry-after, then use exponential backoff with jitter. Immediate retries extend the cooldown.

Does a gateway remove rate limits?

No service can remove Anthropic's limits. A gateway that load-balances across more upstream capacity with failover raises the effective ceiling, so the same workload hits 429 less often.

Fewer 429s — one endpoint, routed across capacity

$1 minimum top-up, pay per token, cancel anytime.

Fix Anthropic API 429 "Too Many Requests"

The three limits behind a 429

Recover correctly: retry-after + backoff

Stop hitting it: five durable fixes

Lower concurrency

Enable prompt caching

Batch non-urgent work

Trim the context you resend

Spread load across more capacity

429 in Claude Code specifically

FAQ

What does a 429 error mean on the Claude API?

Should I retry immediately?

Does a gateway remove rate limits?

Fewer 429s — one endpoint, routed across capacity

Related guides