Home / Context and cost

Level: Advanced · 11 lessons

Context and cost

Doubling effective capacity: KV-cache, masking, compaction, partitioning, quota.

Open the interactive course237 lessons, quizzes, exercises, a final exam with a diploma, 3 languages, free.

The four context levers

Pierre's context-optimization skill formalizes four techniques to extend the effective capacity of context without enlarging the window or changing model. Apply in this order:

KV-cache optimization: keep a stable prompt prefix at the top (system prompt, rules, fixed context) so the model reuses the cache instead of recomputing everything. Don't reorder what doesn't change. Effect: faster and cheaper.
Observation masking: compact bulky tool outputs. After reading a 2000-line file, you no longer need the raw 2000 lines in context: keep the useful result, mask the rest.
Compaction: when usage exceeds ~70%, summarize the old context (what /compact does). Keep the substance, drop the verbatim.
Context partitioning: push bulky work into isolated-context sub-agents (module 5). The main agent only sees the conclusions.

Combined result: you can sustain much longer and more complex sessions without degrading quality or blowing up cost. "More capacity" does not come from a bigger window, but from better hygiene of context.

Key points

1. KV-cache: stable prefix at the top, don't reorder the fixed part
2. Observation masking: compact large tool outputs
3. Compaction: summarize beyond ~70% usage
4. Partitioning: push the bulky into isolated sub-agents

Quota, throughput and model alternation

Cost in money is not the only ceiling. There is also throughput: Pierre's organization is capped at 4000 output tokens per minute on Opus. Claude calls are free in his setup, but this throughput throttles massively parallel or thinking-heavy workflows, which hit 429 errors (rate limit).

Concrete countermeasures he codified:

Reduce read parallelism (a setting like PLAN_READS=1 in his tools) so as not to saturate throughput.
Lower max_tokens and the thinking effort when launching many agents.
Plan retries with backoff on the 429.
Alternate models: Opus for what demands power (architecture, debug), Sonnet/Haiku for delegated repetitive work. It spreads the load and respects throughput.

Pierre's economic rule, counter-intuitive but structuring: Claude calls are the cheap resource; only paid external services really count. So you don't hesitate to multiply Claude reads and agents for quality, you mostly watch throughput and external costs.

Update, July 2026: this cap is history. On July 4, 2026 the 4000 tokens per minute limit of this case study was lifted, and measured pool limits jumped to millions of output tokens per minute, with a separate pool per model family. The method above (measure, alternate models, keep bulk work off your main pool) is the part that survives: quotas are facts with expiry dates, so re-measure yours (one 1-token call reading the anthropic-ratelimit response headers is enough) instead of trusting old numbers.

Key points

Two ceilings: cost in money AND throughput (tokens/minute)
Quotas expire: the 4000 tok/min cap of this case study was lifted in July 2026
Countermeasures: less parallelism, low max_tokens/effort, retries, model alternation
Claude calls cheap; mostly watch throughput and external costs

Audit your context budget

Every conversation with Claude runs inside a context window (the total number of tokens, roughly word-pieces, that fit in one session). Claude Code shows you live usage in its status line. When the window fills up, older content gets dropped or the model starts degrading. Knowing what eats your budget lets you trim the right things.

The biggest consumers are usually: large file reads, long conversation histories, verbose system prompts, and tool outputs that dump entire JSON responses. Use /status inside a Claude Code session to see current token usage. The flag --verbose on any claude command prints per-turn token counts.

The main levers for trimming are:

Compact the conversation: run /compact in Claude Code to summarize history in place and reclaim tokens.
Limit file reads: pass only the relevant lines rather than whole files. Use the offset and limit parameters when reading.
Trim tool output: if a search returns hundreds of matches, filter before sending results to the model.
Clear and restart: run /clear to wipe the conversation entirely when you are starting a new task from scratch.

On the API side (when you call Claude programmatically), prompt caching lets you mark a stable block of context with a cache breakpoint. Anthropic stores that block server-side so you pay only 10 percent of the normal input cost on cache hits. This matters most for large system prompts or reference documents you send on every call.

Key points

Context window: the total tokens Claude can hold in one session.
/compact summarizes history to free up space without losing the thread.
Limit reads to relevant lines only, not whole files.
Prompt caching (API) cuts repeated input cost to 10 percent on cache hits.

/compact and /clear

Every message you send and every reply Claude gives consumes part of the context window (the maximum amount of text Claude can hold in memory at once). In a long coding session, that window fills up fast. Claude Code gives you two commands to manage it: /compact and /clear.

/compact summarises the current conversation into a short digest and replaces the full history with that digest. Claude keeps a working memory of what was decided, which files were changed, and what the goal is, but the raw back-and-forth is gone. Use /compact when you want to continue the same task without losing the thread.

/clear wipes the entire conversation with no summary. Claude starts completely fresh, as if you just opened a new session. Use /clear when you are switching to an unrelated task, when the current context has gone wrong and is misleading Claude, or when you simply want a clean slate.

/compact: keeps the goal, discards the verbosity. Good for long refactoring sessions.
/clear: full reset. Good between separate features or projects.
Neither command deletes your files. They only affect what Claude remembers in this session.
After /compact, Claude may ask you to confirm the summary is accurate before continuing.

Key points

Context window fills as the session grows
/compact summarises and continues; /clear resets completely
Use /compact to stay on task, /clear to switch tasks
Files on disk are never touched by either command

Prompt caching and the KV cache

Every time you send a message to Claude, the model processes your entire input from scratch, token by token. That is fast for short prompts, but expensive and slow when you repeat the same large context (a system prompt, a long document, a big codebase) across many calls. Prompt caching solves this by storing the processed representation of repeated content so it does not have to be recomputed.

The underlying mechanism is the KV cache (key-value cache). During inference (the act of generating a response), the model builds a table of intermediate values for every input token. Normally that table is thrown away after each call. With prompt caching enabled, Anthropic keeps the table alive on its servers for a short window, so the next call that sends the same prefix can skip the recomputation entirely.

Key facts about how the cache behaves:

The cache window is 5 minutes. If your next API call arrives within 5 minutes and starts with the same prefix, you pay the lower cache-read price (roughly 10 percent of the normal input price for Opus claude-opus-4-8 and Sonnet claude-sonnet-4-6).
The first call that fills the cache pays the normal input price plus a small cache-write surcharge (about 25 percent extra), because the server has to store the result.
The cached prefix must be identical down to the byte. Even one changed character invalidates the cache for that position and everything after it.
You mark cacheable blocks explicitly in the API using a cache_control field with "type": "ephemeral". Claude Code and the Claude SDKs handle this automatically for system prompts when you use the --cache flag or the SDK default.

The practical effect: latency drops because the model skips processing thousands of tokens, and cost drops because cached tokens are billed at the read rate. For a workflow that sends the same 20,000-token document to Claude 50 times in a session, caching can cut the input cost by over 80 percent.

Key points

KV cache stores intermediate token computations for 5 minutes
Cache-read tokens cost roughly 10 percent of normal input price
Cache-write surcharge applies on the first call that fills the cache
The cached prefix must be byte-identical to get a cache hit

The Batch API for bulk work

The Batch API is Anthropic's system for sending hundreds (or thousands) of requests at once instead of one at a time. Each group of requests is called a batch. Results are returned asynchronously, meaning you submit the work, walk away, and retrieve the output when it is ready (usually within a few minutes for a hundred requests).

The two main reasons to use the Batch API are cost and throughput. You get a 50 percent discount on all token costs compared to the standard (synchronous) API. You also get an independent rate limit, so your batch work does not compete with your real-time calls for quota headroom.

Typical use cases where the Batch API pays off:

Generating a large synthetic dataset (for example, thousands of question-answer pairs for fine-tuning a model)
Running the same prompt against every row in a spreadsheet or database
Bulk translation, classification, or summarization of a document archive
Nightly evaluation runs that grade model outputs against a test set

Because requests are processed in the background, the Batch API is not suitable for anything that needs an instant reply. For interactive chat or live code assistance, use the standard API or Claude Code directly. But for work you would schedule overnight anyway, the savings are automatic.

Key points

50 percent token cost discount versus synchronous API
Independent rate limit: batch quota does not drain real-time quota
Asynchronous: submit now, retrieve results later
Best for hundreds of identical-shaped requests run offline

Routing work to the cheapest model

Every Claude API call costs money and takes time. The three models on offer sit at very different price points: Haiku (claude-haiku-4-5) is the fastest and cheapest, Sonnet (claude-sonnet-4-6) sits in the middle, and Opus (claude-opus-4-8) is the most capable and the most expensive. Choosing the right model for each task, called model routing, is one of the highest-leverage cost controls you have.

The rule of thumb is simple: match the model to the cognitive load of the task. Reserve Opus for work that genuinely needs deep reasoning, like architecture decisions, complex debugging, or evaluating subtle tradeoffs. Everything else should go to Sonnet or Haiku first.

Tasks that are good candidates for Sonnet or Haiku:

Translating text or reformatting data in bulk
Summarising long documents where precision is not critical
Classifying or labelling items in a dataset
Generating boilerplate code from a clear template
Answering FAQ-style questions with a fixed answer set
Running as a sub-agent inside a larger pipeline (routing, extraction, filtering)

In Claude Code, you can switch the active model at any time with /model. In API calls, set the model parameter per request, so different steps of your workflow can call different models without any extra infrastructure.

Key points

Opus for hard reasoning, Haiku or Sonnet for repetitive tasks
Model routing cuts cost without sacrificing quality where it matters
Claude Code: /model to switch; API: set model per request
Sub-agents in a pipeline are ideal Haiku or Sonnet targets

Counting tokens before you spend

Every API call has a cost determined by the number of tokens (roughly four characters per token in English) processed. Before you run an expensive batch or a long agentic loop, the Anthropic API exposes a dedicated token-counting endpoint that tells you exactly how many input tokens your request would consume, without actually generating a response and without charging you.

The endpoint is POST /v1/messages/count_tokens. You send it the same payload you would send to /v1/messages (model id, system prompt, messages array, tools), but the API returns a single JSON object containing input_tokens. Output tokens cannot be counted in advance because they depend on what the model generates, but you can cap them with the max_tokens parameter to set a hard ceiling on cost.

To estimate total cost you combine the two figures:

Input cost: counted tokens multiplied by the model's input price per million tokens.
Output cost: your expected (or maximum) output tokens multiplied by the output price per million tokens.
Cache savings: if you enable prompt caching (the cache_control field), repeated system-prompt tokens are stored and re-read at roughly 10 percent of the normal input price, cutting costs on long-running workflows.
Batch discount: the Batch API (/v1/messages/batches) gives a 50 percent discount on both input and output for asynchronous workloads.

In Claude Code (the CLI coding agent) you can see live token and cost figures in the status line after each turn. The --max-turns flag limits agentic loops and acts as a cost governor. For one-off checks outside a loop, pipe your prompt through the SDK's client.messages.countTokens() method before committing to the full call.

Key points

Token-counting endpoint returns input_tokens without charging you
Output tokens can only be capped, not counted in advance
Prompt caching and the Batch API are the two main cost levers
Claude Code shows live cost per turn in the status line

Rate limits and surviving a 429

A rate limit is a ceiling the API enforces on how much you can send or receive in a given time window. When you hit it, the server returns HTTP status 429 Too Many Requests. In Claude Code this surfaces as an error message that pauses the current task until the window resets.

Anthropic imposes several independent limits at once: tokens of output per minute, requests per minute, and sometimes a longer rolling window (5 hours or 7 days). Hitting any one of them triggers a 429. The two most common causes are: sending many rapid requests in an automated loop, and using a high max_tokens setting or extended thinking effort that forces the model to generate very long responses.

The standard recovery strategy is exponential backoff with retries: wait a short interval (for example, 2 seconds), retry once, wait twice as long if it fails again, and so on. Most official Anthropic SDKs do this automatically with sensible defaults. In Claude Code, the CLI handles retries internally; you do not have to code them yourself.

When backoff alone is not enough, reduce the pressure on the limit directly:

Lower max_tokens: the smaller the ceiling you set, the fewer tokens the model is allowed to emit per call, which shrinks your per-minute consumption.
Lower thinking effort (the budget_tokens parameter for extended thinking): less budget means fewer internal reasoning tokens counted against your limit.
Spread work across the Batch API: batch requests run on a separate, higher quota and cost 50 percent less.
Switch to a lighter model: claude-haiku-4-5 is faster and cheaper per token than claude-opus-4-8, so the same throughput consumes far fewer rate-limit units.

Key points

429 = rate limit hit, not a billing error
Exponential backoff: retry after progressively longer waits
Lower max_tokens or effort to reduce per-call token spend
Batch API runs on a separate quota at 50 percent discount

Observation masking

Every tool call Claude makes (reading a file, running a shell command, fetching a URL) drops its full output into the context window (the rolling buffer of text the model can see at once). If that output is large or outdated, it wastes tokens and can confuse the model by presenting stale facts alongside fresh ones. Observation masking is the practice of hiding or trimming that tool output so it no longer occupies space in the window.

Claude Code exposes this through the --hide-tool-output flag and through project-level settings. When a tool result is masked, the model still knows the tool was called and whether it succeeded, but the raw text is removed from the active window. This keeps the window lean for long sessions.

Common situations where masking helps:

A grep or find that returned hundreds of lines you already acted on.
A test run whose full stack trace is no longer relevant after the fix.
Repeated file reads of the same large file across many iterations.
Dependency-install logs that are noise once the install succeeded.

The tradeoff is reduced grounding (the model having concrete evidence to reason from). Mask only output you are confident is no longer needed. If you mask too aggressively, the model may repeat work or make assumptions it should not.

Key points

Observation masking removes stale tool output from the active context window.
The model still knows a tool ran and its exit status; only the raw text is hidden.
Mask large, one-time outputs (installs, past grep results) you have already acted on.
Over-masking reduces grounding and can cause the model to repeat or guess.

Rate limits 2026: new tiers, new tokenizer

Two mid-2026 changes force you to redo the cost math you may have built earlier in this course. The first is a rate limit change: on June 26, 2026, Anthropic raised Claude API rate limits (rate limits are the caps on how many requests or tokens per minute your API key can send) so that Sonnet and Haiku now match Opus at every usage tier. At the same time, the old four-tier system consolidated into three usage tiers (usage tiers are spend-based account levels that unlock higher limits as your billing history grows): Start, Build, and Scale. If your earlier capacity planning assumed a four-tier ladder with Sonnet or Haiku capped below Opus, that assumption is now wrong.

The second change is more fundamental: every model since Opus 4.7, including Fable 5 and Sonnet 5, uses a new tokenizer (a tokenizer is the algorithm that splits text into the numeric units, or tokens, that the model actually bills and counts against its context limit). This tokenizer produces roughly 30% more tokens for the same English text than the tokenizer used by older models. Concretely, a 1-million-token context window on these newer models holds about 555,000 words, not the larger word count you might expect from a naive "1 token equals roughly 0.75 words" rule of thumb carried over from earlier models.

That 30% shift is not cosmetic. Any number you calibrated on an older model, cost estimates per request, max_tokens settings (the parameter that caps how many tokens a single response can generate), context-window budgets for how much history you can pack into a prompt, and the token thresholds where prompt caching starts paying off, all need to be re-benchmarked against the new tokenizer. A prompt that used to cost $0.02 in input tokens on an older model can cost more on a newer one even though you sent the exact same English sentence, purely because it now tokenizes into more pieces. The fix is not a blanket "add 30%" rule either, since the ratio varies by content: code, non-English text, and dense technical writing tokenize differently than plain English prose. The only reliable method is to call the token-counting endpoint on your actual prompts with the actual model you plan to use, and read the real number back.

Earlier in this course, the running case study was a real organization whose Claude API access was capped at 4,000 output tokens per minute, a limit tight enough to force careful batching and queuing just to get useful throughput. That cap was lifted on July 4, 2026. After the lift, measured pool limits for that organization jumped to millions of tokens per minute per model family, with separate pools tracked per model rather than one shared pool. The lesson is not "quotas eventually get fixed, stop worrying." The lesson is that quotas are facts with expiry dates. A number you memorized in March can be wrong in July, in either direction, tighter or looser, and code or mental models built on the old number will silently misbehave once the real limit changes underneath it.

The fix is cheap: a single API call with max_tokens set to 1 costs almost nothing and lets you read the live limits directly. The response headers, specifically the family of headers prefixed anthropic-ratelimit-*, report your current request-per-minute and token-per-minute ceilings and how much of each you have used so far. Run that probe before you hard-code a throughput assumption into a pipeline, and re-run it periodically, because the organization in this case study demonstrates that these numbers move. Beyond just measuring, design your workflow so that hitting a real ceiling degrades gracefully, for example by queuing and retrying with backoff, rather than breaking outright, for example by crashing or silently dropping requests, because even a generous limit can still be hit under a traffic spike.

For bulk, non-latency-sensitive work, the Batch API remains the standard way to route volume to a cheaper pool with its own separate rate limits, and as of mid-2026 it supports up to 300,000 output tokens per request on Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 5 or Sonnet 4.6, when you send the beta header output-300k-2026-03-24. That combination, a much larger per-request output ceiling plus a rate-limit pool that does not compete with your interactive traffic, is the durable pattern for large offline jobs: classification sweeps, bulk summarization, or dataset generation that does not need a response in seconds.

Underneath all of these specific numbers is one durable discipline: measure, do not assume. Every fact in this lesson, three usage tiers, a 30% tokenizer shift, a lifted 4,000-token cap, a 300,000-token batch ceiling, will itself have an expiry date. The specific numbers will keep changing as Anthropic ships new models and adjusts capacity. What does not change is the method: read the rate-limit headers instead of trusting old documentation, re-run token counts on the model you are actually calling, route bulk work to the cheaper batch pool, and build workflows that bend instead of break when a ceiling moves.

Key points

Sonnet and Haiku rate limits matched Opus at every tier on June 26, 2026, and tiers consolidated from four to three: Start, Build, Scale.
Models since Opus 4.7 use a new tokenizer producing about 30% more tokens per text; re-benchmark max_tokens, cost estimates, and cache thresholds instead of assuming old numbers still hold.
Quotas expire: a real organization's 4,000-output-tokens-per-minute cap was lifted July 4, 2026, and measured limits jumped to millions per minute per model family.
Verify limits with a cheap 1-token API call reading the anthropic-ratelimit-* headers, and design workflows to degrade gracefully rather than break when a ceiling is hit.

Work with me

Need this level of execution on your project?

I am Pierre Bottazzi. I built this entire course solo, end to end: 237 lessons in 3 languages, the app, the design, the SEO, the accounts system. That is what I do for clients too: web apps, mobile apps, AI automation, SEO/GEO. First call is free, no strings attached.

Contact me on LinkedIn See sept-tools.com (industry)See totemsauvage.com (art gallery)

Inspiration

Inspired by 0xloucash

One of my inspirations. Loucash (0xloucash) has a gift for always digging up the sharpest AI tips and tricks, then turning them into setups that actually work. With InstallClaw he configures your own OpenClaw AI agent, at your place, in 48 hours.

His Instagram InstallClaw