Fix Anthropic API 429 "Too Many Requests"

A 429 from the Claude / Anthropic API means you crossed a rate limit — not that anything is broken. This page explains the three limits that trigger it, the right way to recover, and how to stop hitting it in the first place.

Fast answer: read the retry-after header and wait that long, then retry with exponential backoff + jitter. To stop recurring 429s: lower concurrency, enable prompt caching, batch non-urgent calls, and spread load across more upstream capacity.

The three limits behind a 429

Anthropic enforces per-minute limits that scale with your usage tier. Any one of them can return a 429:

LimitWhat it capsTypical trigger
RPMRequests per minuteMany small calls — agent loops, autocomplete
ITPMInput tokens per minuteLarge context re-sent every turn
OTPMOutput tokens per minuteLong generations in parallel

The 429 response carries a retry-after header (seconds). Honor it — it is the server telling you exactly when the window resets.

Recover correctly: retry-after + backoff

The single most common mistake is retrying instantly in a loop, which keeps you pinned against the limit. Do this instead:

import time, random, httpx

def call_with_backoff(client, payload, max_retries=6):
    delay = 1.0
    for attempt in range(max_retries):
        r = client.post("/v1/messages", json=payload)
        if r.status_code != 429:
            return r
        # Prefer the server's retry-after; fall back to exponential backoff + jitter
        wait = float(r.headers.get("retry-after", delay))
        time.sleep(wait + random.uniform(0, 0.5))
        delay = min(delay * 2, 30)
    raise RuntimeError("Still rate-limited after retries")

Stop hitting it: five durable fixes

  1. Lower concurrency

    Cap parallel in-flight requests (e.g. a semaphore of 4–8). Bursts are what trip RPM; a steady stream rarely does.

  2. Enable prompt caching

    Cache stable prefixes (system prompt, project context). Cached input counts differently toward ITPM, so repetitive context stops eating your limit.

  3. Batch non-urgent work

    Move bulk or background jobs to an asynchronous/batch path so they don't compete with interactive traffic for the same per-minute window.

  4. Trim the context you resend

    Agentic tools (Cline, Cursor, Claude Code) resend context each turn. Smaller context = fewer input tokens per minute. See reducing Cursor token usage.

  5. Spread load across more capacity

    A single key shares one tier's limits. Routing requests across a larger pool of upstream capacity with automatic failover raises your effective ceiling — this is what a smart gateway does, and why the same workload sees fewer 429s through one.

429 in Claude Code specifically

Claude Code is bursty: it fires tool calls and follow-ups rapidly. If you see 429s mid-task, they're usually RPM, not balance. Backoff is built in, but trimming auto-read files and routing through capacity with failover both help. Setup: Claude Code with a custom endpoint.

FAQ

What does a 429 error mean on the Claude API?

You exceeded a per-minute rate limit (RPM, ITPM, or OTPM) for your usage tier. The retry-after header says how long until the window resets.

Should I retry immediately?

No — wait for retry-after, then use exponential backoff with jitter. Immediate retries extend the cooldown.

Does a gateway remove rate limits?

No service can remove Anthropic's limits. A gateway that load-balances across more upstream capacity with failover raises the effective ceiling, so the same workload hits 429 less often.

Fewer 429s — one endpoint, routed across capacity

$1 minimum top-up, pay per token, cancel anytime.

Sign up free → Already a member