Pierre's context-optimization skill formalizes four techniques to extend the effective capacity of context without enlarging the window or changing model. Apply in this order:
KV-cache optimization: keep a stable prompt prefix at the top (system prompt, rules, fixed context) so the model reuses the cache instead of recomputing everything. Don't reorder what doesn't change. Effect: faster and cheaper.
Observation masking: compact bulky tool outputs. After reading a 2000-line file, you no longer need the raw 2000 lines in context: keep the useful result, mask the rest.
Compaction: when usage exceeds ~70%, summarize the old context (what /compact does). Keep the substance, drop the verbatim.
Context partitioning: push bulky work into isolated-context sub-agents (module 5). The main agent only sees the conclusions.
Combined result: you can sustain much longer and more complex sessions without degrading quality or blowing up cost. "More capacity" does not come from a bigger window, but from better hygiene of context.
Key points
1. KV-cache: stable prefix at the top, don't reorder the fixed part
2. Observation masking: compact large tool outputs
3. Compaction: summarize beyond ~70% usage
4. Partitioning: push the bulky into isolated sub-agents
Quota, throughput and model alternation
Cost in money is not the only ceiling. There is also throughput: Pierre's organization is capped at 4000 output tokens per minute on Opus. Claude calls are free in his setup, but this throughput throttles massively parallel or thinking-heavy workflows, which hit 429 errors (rate limit).
Concrete countermeasures he codified:
Reduce read parallelism (a setting like PLAN_READS=1 in his tools) so as not to saturate throughput.
Lower max_tokens and the thinking effort when launching many agents.
Plan retries with backoff on the 429.
Alternate models: Opus for what demands power (architecture, debug), Sonnet/Haiku for delegated repetitive work. It spreads the load and respects throughput.
Pierre's economic rule, counter-intuitive but structuring: Claude calls are the cheap resource; only paid external services really count. So you don't hesitate to multiply Claude reads and agents for quality, you mostly watch throughput and external costs.
Key points
Two ceilings: cost in money AND throughput (tokens/minute)
4000 tok/min on Opus at Pierre's => 429 on heavy workflows
Countermeasures: less parallelism, low max_tokens/effort, retries, model alternation
Claude calls cheap; mostly watch throughput and external costs
Audit your context budget
Every conversation with Claude runs inside a context window (the total number of tokens, roughly word-pieces, that fit in one session). Claude Code shows you live usage in its status line. When the window fills up, older content gets dropped or the model starts degrading. Knowing what eats your budget lets you trim the right things.
The biggest consumers are usually: large file reads, long conversation histories, verbose system prompts, and tool outputs that dump entire JSON responses. Use /status inside a Claude Code session to see current token usage. The flag --verbose on any claude command prints per-turn token counts.
The main levers for trimming are:
Compact the conversation: run /compact in Claude Code to summarize history in place and reclaim tokens.
Limit file reads: pass only the relevant lines rather than whole files. Use the offset and limit parameters when reading.
Trim tool output: if a search returns hundreds of matches, filter before sending results to the model.
Clear and restart: run /clear to wipe the conversation entirely when you are starting a new task from scratch.
On the API side (when you call Claude programmatically), prompt caching lets you mark a stable block of context with a cache breakpoint. Anthropic stores that block server-side so you pay only 10 percent of the normal input cost on cache hits. This matters most for large system prompts or reference documents you send on every call.
Key points
Context window: the total tokens Claude can hold in one session.
/compact summarizes history to free up space without losing the thread.
Limit reads to relevant lines only, not whole files.
Prompt caching (API) cuts repeated input cost to 10 percent on cache hits.
/compact and /clear
Every message you send and every reply Claude gives consumes part of the context window (the maximum amount of text Claude can hold in memory at once). In a long coding session, that window fills up fast. Claude Code gives you two commands to manage it: /compact and /clear.
/compact summarises the current conversation into a short digest and replaces the full history with that digest. Claude keeps a working memory of what was decided, which files were changed, and what the goal is, but the raw back-and-forth is gone. Use /compact when you want to continue the same task without losing the thread.
/clear wipes the entire conversation with no summary. Claude starts completely fresh, as if you just opened a new session. Use /clear when you are switching to an unrelated task, when the current context has gone wrong and is misleading Claude, or when you simply want a clean slate.
/compact: keeps the goal, discards the verbosity. Good for long refactoring sessions.
/clear: full reset. Good between separate features or projects.
Neither command deletes your files. They only affect what Claude remembers in this session.
After /compact, Claude may ask you to confirm the summary is accurate before continuing.
Key points
Context window fills as the session grows
/compact summarises and continues; /clear resets completely
Use /compact to stay on task, /clear to switch tasks
Files on disk are never touched by either command
Prompt caching and the KV cache
Every time you send a message to Claude, the model processes your entire input from scratch, token by token. That is fast for short prompts, but expensive and slow when you repeat the same large context (a system prompt, a long document, a big codebase) across many calls. Prompt caching solves this by storing the processed representation of repeated content so it does not have to be recomputed.
The underlying mechanism is the KV cache (key-value cache). During inference (the act of generating a response), the model builds a table of intermediate values for every input token. Normally that table is thrown away after each call. With prompt caching enabled, Anthropic keeps the table alive on its servers for a short window, so the next call that sends the same prefix can skip the recomputation entirely.
Key facts about how the cache behaves:
The cache window is 5 minutes. If your next API call arrives within 5 minutes and starts with the same prefix, you pay the lower cache-read price (roughly 10 percent of the normal input price for Opus claude-opus-4-8 and Sonnet claude-sonnet-4-6).
The first call that fills the cache pays the normal input price plus a small cache-write surcharge (about 25 percent extra), because the server has to store the result.
The cached prefix must be identical down to the byte. Even one changed character invalidates the cache for that position and everything after it.
You mark cacheable blocks explicitly in the API using a cache_control field with "type": "ephemeral". Claude Code and the Claude SDKs handle this automatically for system prompts when you use the --cache flag or the SDK default.
The practical effect: latency drops because the model skips processing thousands of tokens, and cost drops because cached tokens are billed at the read rate. For a workflow that sends the same 20,000-token document to Claude 50 times in a session, caching can cut the input cost by over 80 percent.
Key points
KV cache stores intermediate token computations for 5 minutes
Cache-read tokens cost roughly 10 percent of normal input price
Cache-write surcharge applies on the first call that fills the cache
The cached prefix must be byte-identical to get a cache hit
The Batch API for bulk work
The Batch API is Anthropic's system for sending hundreds (or thousands) of requests at once instead of one at a time. Each group of requests is called a batch. Results are returned asynchronously, meaning you submit the work, walk away, and retrieve the output when it is ready (usually within a few minutes for a hundred requests).
The two main reasons to use the Batch API are cost and throughput. You get a 50 percent discount on all token costs compared to the standard (synchronous) API. You also get an independent rate limit, so your batch work does not compete with your real-time calls for quota headroom.
Typical use cases where the Batch API pays off:
Generating a large synthetic dataset (for example, thousands of question-answer pairs for fine-tuning a model)
Running the same prompt against every row in a spreadsheet or database
Bulk translation, classification, or summarization of a document archive
Nightly evaluation runs that grade model outputs against a test set
Because requests are processed in the background, the Batch API is not suitable for anything that needs an instant reply. For interactive chat or live code assistance, use the standard API or Claude Code directly. But for work you would schedule overnight anyway, the savings are automatic.
Key points
50 percent token cost discount versus synchronous API
Independent rate limit: batch quota does not drain real-time quota
Asynchronous: submit now, retrieve results later
Best for hundreds of identical-shaped requests run offline
Routing work to the cheapest model
Every Claude API call costs money and takes time. The three models on offer sit at very different price points: Haiku (claude-haiku-4-5) is the fastest and cheapest, Sonnet (claude-sonnet-4-6) sits in the middle, and Opus (claude-opus-4-8) is the most capable and the most expensive. Choosing the right model for each task, called model routing, is one of the highest-leverage cost controls you have.
The rule of thumb is simple: match the model to the cognitive load of the task. Reserve Opus for work that genuinely needs deep reasoning, like architecture decisions, complex debugging, or evaluating subtle tradeoffs. Everything else should go to Sonnet or Haiku first.
Tasks that are good candidates for Sonnet or Haiku:
Translating text or reformatting data in bulk
Summarising long documents where precision is not critical
Classifying or labelling items in a dataset
Generating boilerplate code from a clear template
Answering FAQ-style questions with a fixed answer set
Running as a sub-agent inside a larger pipeline (routing, extraction, filtering)
In Claude Code, you can switch the active model at any time with /model. In API calls, set the model parameter per request, so different steps of your workflow can call different models without any extra infrastructure.
Key points
Opus for hard reasoning, Haiku or Sonnet for repetitive tasks
Model routing cuts cost without sacrificing quality where it matters
Claude Code: /model to switch; API: set model per request
Sub-agents in a pipeline are ideal Haiku or Sonnet targets
Counting tokens before you spend
Every API call has a cost determined by the number of tokens (roughly four characters per token in English) processed. Before you run an expensive batch or a long agentic loop, the Anthropic API exposes a dedicated token-counting endpoint that tells you exactly how many input tokens your request would consume, without actually generating a response and without charging you.
The endpoint is POST /v1/messages/count_tokens. You send it the same payload you would send to /v1/messages (model id, system prompt, messages array, tools), but the API returns a single JSON object containing input_tokens. Output tokens cannot be counted in advance because they depend on what the model generates, but you can cap them with the max_tokens parameter to set a hard ceiling on cost.
To estimate total cost you combine the two figures:
Input cost: counted tokens multiplied by the model's input price per million tokens.
Output cost: your expected (or maximum) output tokens multiplied by the output price per million tokens.
Cache savings: if you enable prompt caching (the cache_control field), repeated system-prompt tokens are stored and re-read at roughly 10 percent of the normal input price, cutting costs on long-running workflows.
Batch discount: the Batch API (/v1/messages/batches) gives a 50 percent discount on both input and output for asynchronous workloads.
In Claude Code (the CLI coding agent) you can see live token and cost figures in the status line after each turn. The --max-turns flag limits agentic loops and acts as a cost governor. For one-off checks outside a loop, pipe your prompt through the SDK's client.messages.countTokens() method before committing to the full call.
Key points
Token-counting endpoint returns input_tokens without charging you
Output tokens can only be capped, not counted in advance
Prompt caching and the Batch API are the two main cost levers
Claude Code shows live cost per turn in the status line
Rate limits and surviving a 429
A rate limit is a ceiling the API enforces on how much you can send or receive in a given time window. When you hit it, the server returns HTTP status 429 Too Many Requests. In Claude Code this surfaces as an error message that pauses the current task until the window resets.
Anthropic imposes several independent limits at once: tokens of output per minute, requests per minute, and sometimes a longer rolling window (5 hours or 7 days). Hitting any one of them triggers a 429. The two most common causes are: sending many rapid requests in an automated loop, and using a high max_tokens setting or extended thinking effort that forces the model to generate very long responses.
The standard recovery strategy is exponential backoff with retries: wait a short interval (for example, 2 seconds), retry once, wait twice as long if it fails again, and so on. Most official Anthropic SDKs do this automatically with sensible defaults. In Claude Code, the CLI handles retries internally; you do not have to code them yourself.
When backoff alone is not enough, reduce the pressure on the limit directly:
Lower max_tokens: the smaller the ceiling you set, the fewer tokens the model is allowed to emit per call, which shrinks your per-minute consumption.
Lower thinking effort (the budget_tokens parameter for extended thinking): less budget means fewer internal reasoning tokens counted against your limit.
Spread work across the Batch API: batch requests run on a separate, higher quota and cost 50 percent less.
Switch to a lighter model: claude-haiku-4-5 is faster and cheaper per token than claude-opus-4-8, so the same throughput consumes far fewer rate-limit units.
Key points
429 = rate limit hit, not a billing error
Exponential backoff: retry after progressively longer waits
Lower max_tokens or effort to reduce per-call token spend
Batch API runs on a separate quota at 50 percent discount
Observation masking
Every tool call Claude makes (reading a file, running a shell command, fetching a URL) drops its full output into the context window (the rolling buffer of text the model can see at once). If that output is large or outdated, it wastes tokens and can confuse the model by presenting stale facts alongside fresh ones. Observation masking is the practice of hiding or trimming that tool output so it no longer occupies space in the window.
Claude Code exposes this through the --hide-tool-output flag and through project-level settings. When a tool result is masked, the model still knows the tool was called and whether it succeeded, but the raw text is removed from the active window. This keeps the window lean for long sessions.
Common situations where masking helps:
A grep or find that returned hundreds of lines you already acted on.
A test run whose full stack trace is no longer relevant after the fix.
Repeated file reads of the same large file across many iterations.
Dependency-install logs that are noise once the install succeeded.
The tradeoff is reduced grounding (the model having concrete evidence to reason from). Mask only output you are confident is no longer needed. If you mask too aggressively, the model may repeat work or make assumptions it should not.
Key points
Observation masking removes stale tool output from the active context window.
The model still knows a tool ran and its exit status; only the raw text is hidden.
Mask large, one-time outputs (installs, past grep results) you have already acted on.
Over-masking reduces grounding and can cause the model to repeat or guess.
Work with me
Master Claude, Claude Code and LLMs, from your first prompt to multi-agent orchestration.
Like this course? I built it end to end. Need a web app, mobile app, AI automation or SEO/GEO? Let us talk.