The Claude Bible
Home / Context and cost
Level: Advanced · 10 lessons

Context and cost

Doubling effective capacity: KV-cache, masking, compaction, partitioning, quota.

Open the interactive course212 lessons, quizzes, exercises, 3 languages, free.

The four context levers

Pierre's context-optimization skill formalizes four techniques to extend the effective capacity of context without enlarging the window or changing model. Apply in this order:

  1. KV-cache optimization: keep a stable prompt prefix at the top (system prompt, rules, fixed context) so the model reuses the cache instead of recomputing everything. Don't reorder what doesn't change. Effect: faster and cheaper.
  2. Observation masking: compact bulky tool outputs. After reading a 2000-line file, you no longer need the raw 2000 lines in context: keep the useful result, mask the rest.
  3. Compaction: when usage exceeds ~70%, summarize the old context (what /compact does). Keep the substance, drop the verbatim.
  4. Context partitioning: push bulky work into isolated-context sub-agents (module 5). The main agent only sees the conclusions.

Combined result: you can sustain much longer and more complex sessions without degrading quality or blowing up cost. "More capacity" does not come from a bigger window, but from better hygiene of context.

Key points
  • 1. KV-cache: stable prefix at the top, don't reorder the fixed part
  • 2. Observation masking: compact large tool outputs
  • 3. Compaction: summarize beyond ~70% usage
  • 4. Partitioning: push the bulky into isolated sub-agents

Quota, throughput and model alternation

Cost in money is not the only ceiling. There is also throughput: Pierre's organization is capped at 4000 output tokens per minute on Opus. Claude calls are free in his setup, but this throughput throttles massively parallel or thinking-heavy workflows, which hit 429 errors (rate limit).

Concrete countermeasures he codified:

Pierre's economic rule, counter-intuitive but structuring: Claude calls are the cheap resource; only paid external services really count. So you don't hesitate to multiply Claude reads and agents for quality, you mostly watch throughput and external costs.

Key points
  • Two ceilings: cost in money AND throughput (tokens/minute)
  • 4000 tok/min on Opus at Pierre's => 429 on heavy workflows
  • Countermeasures: less parallelism, low max_tokens/effort, retries, model alternation
  • Claude calls cheap; mostly watch throughput and external costs

Audit your context budget

Every conversation with Claude runs inside a context window (the total number of tokens, roughly word-pieces, that fit in one session). Claude Code shows you live usage in its status line. When the window fills up, older content gets dropped or the model starts degrading. Knowing what eats your budget lets you trim the right things.

The biggest consumers are usually: large file reads, long conversation histories, verbose system prompts, and tool outputs that dump entire JSON responses. Use /status inside a Claude Code session to see current token usage. The flag --verbose on any claude command prints per-turn token counts.

The main levers for trimming are:

On the API side (when you call Claude programmatically), prompt caching lets you mark a stable block of context with a cache breakpoint. Anthropic stores that block server-side so you pay only 10 percent of the normal input cost on cache hits. This matters most for large system prompts or reference documents you send on every call.

Key points
  • Context window: the total tokens Claude can hold in one session.
  • /compact summarizes history to free up space without losing the thread.
  • Limit reads to relevant lines only, not whole files.
  • Prompt caching (API) cuts repeated input cost to 10 percent on cache hits.

/compact and /clear

Every message you send and every reply Claude gives consumes part of the context window (the maximum amount of text Claude can hold in memory at once). In a long coding session, that window fills up fast. Claude Code gives you two commands to manage it: /compact and /clear.

/compact summarises the current conversation into a short digest and replaces the full history with that digest. Claude keeps a working memory of what was decided, which files were changed, and what the goal is, but the raw back-and-forth is gone. Use /compact when you want to continue the same task without losing the thread.

/clear wipes the entire conversation with no summary. Claude starts completely fresh, as if you just opened a new session. Use /clear when you are switching to an unrelated task, when the current context has gone wrong and is misleading Claude, or when you simply want a clean slate.

Key points
  • Context window fills as the session grows
  • /compact summarises and continues; /clear resets completely
  • Use /compact to stay on task, /clear to switch tasks
  • Files on disk are never touched by either command

Prompt caching and the KV cache

Every time you send a message to Claude, the model processes your entire input from scratch, token by token. That is fast for short prompts, but expensive and slow when you repeat the same large context (a system prompt, a long document, a big codebase) across many calls. Prompt caching solves this by storing the processed representation of repeated content so it does not have to be recomputed.

The underlying mechanism is the KV cache (key-value cache). During inference (the act of generating a response), the model builds a table of intermediate values for every input token. Normally that table is thrown away after each call. With prompt caching enabled, Anthropic keeps the table alive on its servers for a short window, so the next call that sends the same prefix can skip the recomputation entirely.

Key facts about how the cache behaves:

The practical effect: latency drops because the model skips processing thousands of tokens, and cost drops because cached tokens are billed at the read rate. For a workflow that sends the same 20,000-token document to Claude 50 times in a session, caching can cut the input cost by over 80 percent.

Key points
  • KV cache stores intermediate token computations for 5 minutes
  • Cache-read tokens cost roughly 10 percent of normal input price
  • Cache-write surcharge applies on the first call that fills the cache
  • The cached prefix must be byte-identical to get a cache hit

The Batch API for bulk work

The Batch API is Anthropic's system for sending hundreds (or thousands) of requests at once instead of one at a time. Each group of requests is called a batch. Results are returned asynchronously, meaning you submit the work, walk away, and retrieve the output when it is ready (usually within a few minutes for a hundred requests).

The two main reasons to use the Batch API are cost and throughput. You get a 50 percent discount on all token costs compared to the standard (synchronous) API. You also get an independent rate limit, so your batch work does not compete with your real-time calls for quota headroom.

Typical use cases where the Batch API pays off:

Because requests are processed in the background, the Batch API is not suitable for anything that needs an instant reply. For interactive chat or live code assistance, use the standard API or Claude Code directly. But for work you would schedule overnight anyway, the savings are automatic.

Key points
  • 50 percent token cost discount versus synchronous API
  • Independent rate limit: batch quota does not drain real-time quota
  • Asynchronous: submit now, retrieve results later
  • Best for hundreds of identical-shaped requests run offline

Routing work to the cheapest model

Every Claude API call costs money and takes time. The three models on offer sit at very different price points: Haiku (claude-haiku-4-5) is the fastest and cheapest, Sonnet (claude-sonnet-4-6) sits in the middle, and Opus (claude-opus-4-8) is the most capable and the most expensive. Choosing the right model for each task, called model routing, is one of the highest-leverage cost controls you have.

The rule of thumb is simple: match the model to the cognitive load of the task. Reserve Opus for work that genuinely needs deep reasoning, like architecture decisions, complex debugging, or evaluating subtle tradeoffs. Everything else should go to Sonnet or Haiku first.

Tasks that are good candidates for Sonnet or Haiku:

In Claude Code, you can switch the active model at any time with /model. In API calls, set the model parameter per request, so different steps of your workflow can call different models without any extra infrastructure.

Key points
  • Opus for hard reasoning, Haiku or Sonnet for repetitive tasks
  • Model routing cuts cost without sacrificing quality where it matters
  • Claude Code: /model to switch; API: set model per request
  • Sub-agents in a pipeline are ideal Haiku or Sonnet targets

Counting tokens before you spend

Every API call has a cost determined by the number of tokens (roughly four characters per token in English) processed. Before you run an expensive batch or a long agentic loop, the Anthropic API exposes a dedicated token-counting endpoint that tells you exactly how many input tokens your request would consume, without actually generating a response and without charging you.

The endpoint is POST /v1/messages/count_tokens. You send it the same payload you would send to /v1/messages (model id, system prompt, messages array, tools), but the API returns a single JSON object containing input_tokens. Output tokens cannot be counted in advance because they depend on what the model generates, but you can cap them with the max_tokens parameter to set a hard ceiling on cost.

To estimate total cost you combine the two figures:

In Claude Code (the CLI coding agent) you can see live token and cost figures in the status line after each turn. The --max-turns flag limits agentic loops and acts as a cost governor. For one-off checks outside a loop, pipe your prompt through the SDK's client.messages.countTokens() method before committing to the full call.

Key points
  • Token-counting endpoint returns input_tokens without charging you
  • Output tokens can only be capped, not counted in advance
  • Prompt caching and the Batch API are the two main cost levers
  • Claude Code shows live cost per turn in the status line

Rate limits and surviving a 429

A rate limit is a ceiling the API enforces on how much you can send or receive in a given time window. When you hit it, the server returns HTTP status 429 Too Many Requests. In Claude Code this surfaces as an error message that pauses the current task until the window resets.

Anthropic imposes several independent limits at once: tokens of output per minute, requests per minute, and sometimes a longer rolling window (5 hours or 7 days). Hitting any one of them triggers a 429. The two most common causes are: sending many rapid requests in an automated loop, and using a high max_tokens setting or extended thinking effort that forces the model to generate very long responses.

The standard recovery strategy is exponential backoff with retries: wait a short interval (for example, 2 seconds), retry once, wait twice as long if it fails again, and so on. Most official Anthropic SDKs do this automatically with sensible defaults. In Claude Code, the CLI handles retries internally; you do not have to code them yourself.

When backoff alone is not enough, reduce the pressure on the limit directly:

Key points
  • 429 = rate limit hit, not a billing error
  • Exponential backoff: retry after progressively longer waits
  • Lower max_tokens or effort to reduce per-call token spend
  • Batch API runs on a separate quota at 50 percent discount

Observation masking

Every tool call Claude makes (reading a file, running a shell command, fetching a URL) drops its full output into the context window (the rolling buffer of text the model can see at once). If that output is large or outdated, it wastes tokens and can confuse the model by presenting stale facts alongside fresh ones. Observation masking is the practice of hiding or trimming that tool output so it no longer occupies space in the window.

Claude Code exposes this through the --hide-tool-output flag and through project-level settings. When a tool result is masked, the model still knows the tool was called and whether it succeeded, but the raw text is removed from the active window. This keeps the window lean for long sessions.

Common situations where masking helps:

The tradeoff is reduced grounding (the model having concrete evidence to reason from). Mask only output you are confident is no longer needed. If you mask too aggressively, the model may repeat work or make assumptions it should not.

Key points
  • Observation masking removes stale tool output from the active context window.
  • The model still knows a tool ran and its exit status; only the raw text is hidden.
  • Mask large, one-time outputs (installs, past grep results) you have already acted on.
  • Over-masking reduces grounding and can cause the model to repeat or guess.
Work with me

Master Claude, Claude Code and LLMs, from your first prompt to multi-agent orchestration.

Like this course? I built it end to end. Need a web app, mobile app, AI automation or SEO/GEO? Let us talk.

Contact me on LinkedInSee a site I built