rate_limit_error
You've exceeded your account's requests-per-minute (RPM) or tokens-per-minute (TPM) limit. Unlike overloaded_error, this is specific to your API key/account. Reduce concurrency and implement proper backoff.
What the error looks like
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
}
}
Rate limit headers (read these!)
Every API response includes headers showing your current limits and usage:
x-ratelimit-limit-requests: 1000
x-ratelimit-remaining-requests: 247
x-ratelimit-limit-tokens: 80000
x-ratelimit-remaining-tokens: 12450
x-ratelimit-reset-requests: 2026-05-18T22:01:00Z
x-ratelimit-reset-tokens: 2026-05-18T22:00:45Z
retry-after: 23
Parse these headers proactively to throttle before hitting the limit.
Rate limit tiers (Claude API)
Limits are per-model. Verify exact limits in your Anthropic Console.
| Tier | RPM (Sonnet) | TPM (Sonnet) |
|---|---|---|
| Free | 5 | 25,000 |
| Build ($5 spent) | 1,000 | 80,000 |
| Build ($100 spent) | 2,000 | 160,000 |
| Scale / Enterprise | Custom | Custom |
Fix: Proactive throttling (Python)
import anthropic
import time
client = anthropic.Anthropic(api_key="your-key", max_retries=4)
def smart_call(messages, model="claude-sonnet-4-6"):
"""Reads rate-limit headers and sleeps proactively."""
response = client.messages.create(
model=model,
max_tokens=1024,
messages=messages,
)
return response
# Use a semaphore to cap concurrent requests
import asyncio
import anthropic
sem = asyncio.Semaphore(5) # max 5 concurrent requests
async def rate_limited_call(async_client, prompt):
async with sem:
return await async_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
async def batch_process(prompts):
async_client = anthropic.AsyncAnthropic()
tasks = [rate_limited_call(async_client, p) for p in prompts]
return await asyncio.gather(*tasks, return_exceptions=True)
Fix: Token-aware throttling (TypeScript)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ maxRetries: 4 });
// Track remaining tokens from response headers
let remainingTokens = Infinity;
async function throttledCall(prompt: string) {
// If we're low on tokens, wait for the reset
if (remainingTokens < 5000) {
console.log("Low on tokens, waiting 10s...");
await new Promise(r => setTimeout(r, 10_000));
}
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
// Parse headers (available on the raw response)
// @ts-ignore — access via response._request.response
// Use the SDK's on("response") hook for production use
return response;
}
FAQ
How do I check my current rate limits?
Go to console.anthropic.com → Settings → Limits. Or read the
x-ratelimit-* headers on any API response.How do I request higher rate limits?
Spend more (tier advancement is automatic above spend thresholds) or contact Anthropic sales for custom enterprise limits.
Why am I hitting TPM limits with few requests?
Long prompts or large
max_tokens values consume tokens fast. Reduce prompt length, use max_tokens conservatively, or switch to a model with higher TPM limits.Can I use the Batch API to avoid rate limits?
Yes — the Message Batches API has separate (higher) limits and is 50% cheaper. Use it for workloads that can tolerate 24h latency.