rate_limit_error

HTTP 429 Retryable Account Limit

You've exceeded your account's requests-per-minute (RPM) or tokens-per-minute (TPM) limit. Unlike overloaded_error, this is specific to your API key/account. Reduce concurrency and implement proper backoff.

What the error looks like

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
  }
}

Rate limit headers (read these!)

Every API response includes headers showing your current limits and usage:

x-ratelimit-limit-requests: 1000
x-ratelimit-remaining-requests: 247
x-ratelimit-limit-tokens: 80000
x-ratelimit-remaining-tokens: 12450
x-ratelimit-reset-requests: 2026-05-18T22:01:00Z
x-ratelimit-reset-tokens: 2026-05-18T22:00:45Z
retry-after: 23

Parse these headers proactively to throttle before hitting the limit.

Rate limit tiers (Claude API)

Limits are per-model. Verify exact limits in your Anthropic Console.
Tier RPM (Sonnet) TPM (Sonnet)
Free 5 25,000
Build ($5 spent) 1,000 80,000
Build ($100 spent) 2,000 160,000
Scale / Enterprise Custom Custom

Fix: Proactive throttling (Python)

import anthropic
import time

client = anthropic.Anthropic(api_key="your-key", max_retries=4)

def smart_call(messages, model="claude-sonnet-4-6"):
    """Reads rate-limit headers and sleeps proactively."""
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=messages,
    )
    return response

# Use a semaphore to cap concurrent requests
import asyncio
import anthropic

sem = asyncio.Semaphore(5)   # max 5 concurrent requests

async def rate_limited_call(async_client, prompt):
    async with sem:
        return await async_client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )

async def batch_process(prompts):
    async_client = anthropic.AsyncAnthropic()
    tasks = [rate_limited_call(async_client, p) for p in prompts]
    return await asyncio.gather(*tasks, return_exceptions=True)

Fix: Token-aware throttling (TypeScript)

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ maxRetries: 4 });

// Track remaining tokens from response headers
let remainingTokens = Infinity;

async function throttledCall(prompt: string) {
  // If we're low on tokens, wait for the reset
  if (remainingTokens < 5000) {
    console.log("Low on tokens, waiting 10s...");
    await new Promise(r => setTimeout(r, 10_000));
  }

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });

  // Parse headers (available on the raw response)
  // @ts-ignore — access via response._request.response
  // Use the SDK's on("response") hook for production use
  return response;
}

FAQ

How do I check my current rate limits?
Go to console.anthropic.com → Settings → Limits. Or read the x-ratelimit-* headers on any API response.
How do I request higher rate limits?
Spend more (tier advancement is automatic above spend thresholds) or contact Anthropic sales for custom enterprise limits.
Why am I hitting TPM limits with few requests?
Long prompts or large max_tokens values consume tokens fast. Reduce prompt length, use max_tokens conservatively, or switch to a model with higher TPM limits.
Can I use the Batch API to avoid rate limits?
Yes — the Message Batches API has separate (higher) limits and is 50% cheaper. Use it for workloads that can tolerate 24h latency.