Home / The Claude API for builders

Level: Expert · 11 lessons

The Claude API for builders

Calling Claude directly: messages, tools, streaming, batch.

Open the interactive course237 lessons, quizzes, exercises, a final exam with a diploma, 3 languages, free.

The Messages API

The Messages API is the core HTTP endpoint that lets any program talk to Claude. Instead of opening a chat window, your code sends a structured JSON request and receives a structured JSON response. JSON (JavaScript Object Notation) is a standard text format for exchanging data.

Every request must include three things: the model id (which Claude version to use), max_tokens (the maximum number of tokens, or word-pieces, Claude may generate in the reply), and a messages array (the conversation history as a list of role/content pairs).

The official Anthropic SDK (Software Development Kit) for Node.js wraps this HTTP call in a simple JavaScript function. Install it with npm, then write a few lines:

Set your API key as an environment variable called ANTHROPIC_API_KEY.
Create a client: const Anthropic = require("@anthropic-ai/sdk"); const client = new Anthropic();
Call client.messages.create({ ... }) with your model, max_tokens, and messages.
Read the reply from response.content[0].text.

The model ids for June 2026 are claude-opus-4-8 (most capable), claude-sonnet-4-6 (balanced), and claude-haiku-4-5 (fastest). Start with Haiku while learning: it is cheap and instant.

Key points

Messages API: the HTTP endpoint your code calls to reach Claude
max_tokens controls the maximum length of Claude's reply
messages array holds the conversation as role/content pairs
ANTHROPIC_API_KEY must be set before any call will succeed

System, user, assistant over the API

Every call to the Claude API is built from a sequence of messages, each tagged with a role. The three roles are system, user, and assistant. The system role is a special top-level parameter (not part of the messages array) that sets persistent instructions for the entire conversation. Think of it as the briefing you give Claude before the conversation begins.

The messages array then alternates between user (the human turn) and assistant (Claude's reply). You can pre-fill this array with past turns to simulate a multi-turn conversation, or inject a partial assistant turn to steer the very first word of the response.

Why does role order matter? Claude is trained to respect the hierarchy: system instructions carry the highest weight, then the conversation history. If a user message conflicts with the system prompt, Claude follows the system prompt. This makes the system parameter the right place for rules, personas, output formats, and safety guardrails.

system: top-level string, set once per request, never shown as a message bubble.
user: a human turn, required at least once as the final message.
assistant: Claude's previous replies, or a prefill string to constrain the next reply.
Messages must alternate user/assistant; two consecutive user turns are rejected by the API.

Key points

system parameter sets persistent rules for the whole request
messages array must alternate user and assistant roles
prefilling the assistant turn constrains Claude's first token
system outranks user when instructions conflict

Tool use over the API

The Claude API lets you give the model a list of tools (also called function definitions) it can invoke. Each tool is a JSON object describing a name, a description, and an input schema (a JSON Schema object that tells Claude what parameters the tool accepts). Claude never runs the tool itself; it returns a structured tool_use block that your code must handle.

A typical round-trip works like this:

You send a messages request that includes a tools array.
If Claude decides to call a tool, the response stop_reason is "tool_use" and the content contains a tool_use block with an id, the tool name, and the input object.
Your code executes the real action (database query, API call, calculation), then appends a tool_result block to the conversation using the same tool_use_id.
You send the updated conversation back to Claude, which reads the result and produces its final answer.

Two key design choices affect reliability. First, write the tool description as if you are explaining the function to a junior colleague: Claude uses it to decide when and whether to call the tool. Second, keep your input schema strict: mark required fields, use enum where values are fixed, and avoid vague string fields when a number or boolean is correct. Vague schemas produce vague inputs.

When you need Claude to call exactly one specific tool, set tool_choice to {"type": "tool", "name": "your_tool_name"}. The default "auto" lets Claude decide. Use "any" to force at least one tool call without specifying which one.

Key points

Declare tools as JSON Schema objects in the <code>tools</code> array
Claude returns a <code>tool_use</code> block; your code runs the action
Send the result back as a <code>tool_result</code> block to continue
Use <code>tool_choice</code> to control whether Claude must call a tool

Streaming responses

By default, the Anthropic API waits until the model finishes generating before sending anything back. Streaming changes that: the API sends each token (a word fragment, roughly 3 to 4 characters) to your client the moment it is produced, so the user sees text appearing word by word instead of waiting for the full reply.

Streaming uses the Server-Sent Events (SSE) protocol. The server keeps the HTTP connection open and pushes small event chunks down the wire. Each chunk carries a delta, which is the incremental new text to append. Your client accumulates deltas to reconstruct the full message.

To enable streaming with the Anthropic Python or Node SDK, pass stream=True (Python) or use the .stream() method (Node). The SDK exposes an async iterator so you process one chunk at a time without buffering the whole response in memory. That matters for long outputs: a 4000-token reply can start rendering in under a second instead of waiting several seconds for completion.

stream=True (Python): returns a context manager; iterate text_stream for raw text chunks.
.stream() (Node/TS): returns an async iterable; use for await to consume chunks.
Event types: message_start, content_block_delta, message_delta, message_stop.
Usage stats arrive in the final message_stop event, not up front.

Key points

Streaming sends tokens as they are generated, not after completion.
Server-Sent Events (SSE) keep one HTTP connection open for all chunks.
Each chunk carries a delta: the new text fragment to append.
Final token usage counts arrive only in the last event.

Prompt caching API

Every API call re-processes every token you send. Prompt caching lets you mark stable sections of your request so Anthropic stores a compiled version on their servers. Subsequent calls that hit the same prefix skip re-processing and pay a much lower rate: roughly 10 percent of the normal input cost for cache hits, versus 125 percent for the initial write that populates the cache.

You mark a cacheable boundary by adding "cache_control": {"type": "breakpoint"} inside a content block. Claude reads your prompt top-to-bottom and caches everything up to that marker. You can place up to four breakpoints per request. The most common pattern is one breakpoint after a long system prompt or a large document you reuse across many calls.

A few rules govern when the cache is actually used:

The prefix must be at least 1024 tokens (approximately 750 words) to qualify for caching.
Cache entries expire after five minutes of inactivity; each hit resets the timer.
The model, version, and all content before the breakpoint must be byte-identical across calls.
Supported models (June 2026): claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5.

The API response includes a usage object with cache_creation_input_tokens and cache_read_input_tokens so you can verify hits and measure savings in real time.

Key points

Add cache_control breakpoint to stable content blocks
Prefix must be 1024+ tokens to qualify
Cache hit costs ~10% of normal input price
Check usage.cache_read_input_tokens to confirm hits

The Batch API

The Batch API lets you submit up to 10,000 requests in a single call and get all the results back asynchronously (meaning you do not wait for a live reply; you check back later). In exchange for this flexibility, Anthropic charges 50% less per token than the standard real-time API.

You send a JSON file containing a list of requests, each with its own unique custom_id so you can match results to inputs. Claude processes them in the background, typically within a few minutes for hundreds of requests, though the SLA (Service Level Agreement, the official time guarantee) allows up to 24 hours.

The Batch API has its own independent rate limit, separate from the real-time API. This means heavy batch work does not eat into your interactive quota. It is ideal for any offline task: generating datasets, running evaluations, translating large catalogs, or classifying thousands of records.

Supported models: claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5
Max requests per batch: 10,000
Discount: 50% off input and output tokens vs. real-time pricing
Result format: one JSONL line per request, matched by custom_id
Cancellation: you can cancel a batch in flight with a single API call

Key points

Batch API cuts token costs by 50% for asynchronous workloads
Each request in a batch carries a custom_id for result matching
Batch rate limits are independent from real-time rate limits
Results arrive as a JSONL file, not a streaming response

Counting tokens

Before you send a request to Claude, you can ask the API to count exactly how many tokens (the chunks of text the model reads and writes) that request will consume. This uses the token counting endpoint: POST /v1/messages/count_tokens. It accepts the same body as a normal messages request but returns only a count, never a response, and costs nothing.

Token counts matter for two reasons. First, every model has a context window (the maximum tokens it can see at once): 200,000 for Opus and Sonnet, 200,000 for Haiku. Second, you are billed per input and output token, so over-sending wastes money and under-sending may truncate your prompt. Counting lets you stay under the limit and forecast cost before committing.

Key things you can count before sending:

System prompt alone, to understand its fixed overhead.
Tool definitions, which often surprise developers by being large.
Conversation history, to decide when to summarize or drop old turns.
Uploaded files or long documents, to verify they fit.

For token budgeting, set a soft ceiling in your code: if input_tokens from the count endpoint exceeds, say, 150,000, truncate or summarize before sending. You can also pair counting with the max_tokens parameter (which caps output length) to control total spend per call precisely.

Key points

Token counting endpoint: POST /v1/messages/count_tokens
Context window: 200,000 tokens for Opus, Sonnet, and Haiku (as of mid-2026)
Count before sending to catch overflows and forecast cost
Use max_tokens to cap output and control spend

Model ids, pricing and migration

Every Claude model has a model id, the exact string you pass to the API to request a specific version. As of June 2026, the three current ids are claude-opus-4-8 (most capable, highest cost), claude-sonnet-4-6 (balanced performance and cost), and claude-haiku-4-5 (fastest, lowest cost). Always use the full versioned id in production code, never an alias like "claude-opus" without a version suffix, because Anthropic may silently reroute aliases to newer models and change your costs or behavior.

Choosing the right model is a cost-performance trade-off. A practical rule of thumb:

Opus (claude-opus-4-8): architecture decisions, complex reasoning, long document analysis, agentic loops where quality matters most.
Sonnet (claude-sonnet-4-6): everyday coding tasks, summarization, drafting, multi-step workflows where speed and cost matter.
Haiku (claude-haiku-4-5): classification, routing, quick lookups, high-volume batch jobs where latency is critical.

Migration means switching your codebase from an old model id to a newer one. The safe pattern is: update the model id string, run your existing eval or test suite against the new model, compare outputs on a sample of real prompts, then ship. Because newer models may refuse differently or format output differently, never migrate without a comparison step. Anthropic publishes a migration guide for each generation; check it for breaking changes in tool-call formats or context-window sizes before you switch.

Pricing is per-token (a token is roughly four characters of English text). You pay separately for input tokens (what you send) and output tokens (what the model returns). Output tokens cost more. Use prompt caching to reuse a large system prompt across calls and cut input costs by up to 90 percent on the cached portion. The Anthropic Batch API gives a 50 percent discount on both input and output at the cost of higher latency, ideal for offline dataset generation.

Update, July 2026: the current lineup is Fable 5 (claude-fable-5, $10/$50 per million tokens), Opus 4.8 (claude-opus-4-8, $5/$25), Sonnet 5 (claude-sonnet-5, $3/$15, introductory $2/$10 through August 31, 2026) and Haiku 4.5 (claude-haiku-4-5-20251001, $1/$5). Opus 4.1 retires on August 5, 2026; Opus 4.7/4.6/4.5 and Sonnet 4.6/4.5 are legacy. Since the 4.6 generation, dateless model ids are pinned snapshots, not evergreen pointers. Two lessons at the end of this module cover Sonnet 5's breaking changes and Fable 5's refusal contract.

Key points

Use full versioned model ids, never bare aliases, in production.
Opus for quality, Sonnet for balance, Haiku for speed and volume.
Always run an eval comparison before migrating to a new model id.
Prompt caching and the Batch API are the two main cost levers.

Vision and PDF inputs

The Claude API accepts more than text. You can send images and PDF files directly in the messages array, alongside or instead of a text prompt. The model reads the visual content and reasons over it just as it would over written words. This capability is called multimodal input (multi-format, not text-only).

Images are passed as base64-encoded strings (a way of representing binary file data as plain ASCII text) inside a content block with "type": "image". You specify the media type such as image/jpeg, image/png, image/gif, or image/webp. Alternatively you can pass a public URL using "type": "image" with a "url" source instead of base64.

PDFs use "type": "document" with "media_type": "application/pdf" and the file content as base64. Claude reads the full text layer of the PDF and, when pages contain diagrams or charts, also interprets those visually. PDFs are capped at 100 pages and roughly 32 MB per file.

Supported image formats: JPEG, PNG, GIF, WebP.
Max image size per request: 20 MB (base64 encoded weight is about 33 percent larger than the raw file).
Up to 20 images per request on current models.
PDFs: max 100 pages, 32 MB raw. Text and visual content both parsed.
Vision is available on claude-opus-4-8, claude-sonnet-4-6, and claude-haiku-4-5.

Key points

Pass images via base64 or URL in a content block with type:image
PDFs use type:document and media_type:application/pdf
Limits: 20 images per request, PDFs up to 100 pages and 32 MB
Vision works on all three current Claude model tiers

Sonnet 5 on the API: what breaks, what wins

On June 30, 2026, Anthropic launched claude-sonnet-5, replacing Sonnet 4.6 as the mid-tier model in the Claude family. It is also the new default model on claude.ai Free and Pro plans, and in Claude Code since version 2.1.197. If you build against the API, this is the model most of your production traffic will run on unless you deliberately pin an older one.

The specs are a real step up. Sonnet 5 ships with a 1M-token context window (roughly one million tokens of combined input and conversation history) as the only size offered, no smaller variant to choose. Maximum output per request is 128,000 tokens (the max_tokens parameter, which caps how much text a single response can generate). Its knowledge cutoff (the date after which it has no training data about the world) is January 2026. Anthropic describes it as the most agentic Sonnet yet, meaning it plans and executes multi-step tool-using tasks with less hand-holding, and it now approaches Opus 4.8 quality on many coding and agentic benchmarks at a much lower price.

Pricing during the introductory period, which runs through August 31, 2026, is $2 per million tokens (MTok) input and $10 per MTok output. After that date it reverts to the standard Sonnet-tier rate of $3 input / $15 output per MTok. If you are forecasting cost for a project that spans that date, budget for the higher rate on anything running past September 1, 2026.

Three changes will break existing API code that was written for Sonnet 4.6 or earlier models, so treat this as a checklist before you flip the model string. First, adaptive thinking is on by default: unlike Opus 4.7 and 4.8, where omitting the thinking parameter runs the model without reasoning, on Sonnet 5 simply not setting thinking still triggers adaptive thinking (the model deciding on its own when and how much to reason before answering). Second, manual extended thinking is removed: sending thinking: {type: "enabled", budget_tokens: N}, the old way of giving the model a fixed reasoning token budget, now returns a 400 error (a request-rejected response) instead of being silently accepted or deprecated. Third, non-default sampling parameters are rejected: setting temperature, top_p, or top_k to anything other than their defaults returns a 400 error. These sampling knobs, which used to let you tune randomness in the model's output, are gone entirely on Sonnet-class models as of this release.

There is also a quieter but costly change: the tokenizer (the algorithm that splits text into the units the model actually counts and bills). Sonnet 5's tokenizer breaks the same input text into roughly 30% more tokens than Sonnet 4.6's tokenizer did. That means a prompt that cost you 10,000 tokens on Sonnet 4.6 might cost around 13,000 tokens on Sonnet 5, even though nothing about the text changed. Any cost estimate, context-window budget, or rate-limit calculation you built against Sonnet 4.6 needs to be re-run against Sonnet 5 rather than reused.

Alongside the launch, Anthropic tightened up the model lineup. Opus 4.1 is deprecated and will retire on August 5, 2026, so any code still targeting it needs a migration plan before that date. Opus 4.7, Opus 4.6, Opus 4.5, Sonnet 4.6, and Sonnet 4.5 are now considered legacy, meaning they remain callable but are no longer the recommended choice for new work. One naming detail worth knowing: since the 4.6 generation of models, a dateless model id (a name like claude-sonnet-5 with no date suffix) is a pinned snapshot, not an evergreen pointer that silently updates to a newer model over time. That id will keep returning the same model version indefinitely, which is good for reproducibility but means you have to actively change the string yourself to pick up a future release.

To migrate existing code to Sonnet 5, work through this checklist in order. Swap the model id string to claude-sonnet-5. Delete any temperature, top_p, top_k, and budget_tokens parameters from your request payloads, since all of them now cause errors. Re-run your token counts using the API's token-counting endpoint against real prompts, because the 30% tokenizer shift changes every estimate you had. Re-price your workload using the new $2/$10 introductory rate (or $3/$15 after August 31, 2026) rather than reusing old Sonnet 4.6 cost figures. Finally, re-test your guardrails, meaning any content filters, output-length checks, or safety logic you had tuned against the old model's behavior, since a more agentic, differently-tokenizing model can shift response patterns in ways that slip past checks calibrated on the previous version.

Key points

Sonnet 5 launched June 30, 2026: 1M context, 128K max output, Jan 2026 knowledge cutoff, intro pricing $2/$10 per MTok through August 31, 2026, then $3/$15.
Three breaking API changes: adaptive thinking defaults on, manual budget_tokens thinking is removed (400 error), and non-default temperature/top_p/top_k are rejected (400 error).
The new tokenizer produces roughly 30% more tokens for the same text than Sonnet 4.6, so re-run cost and context-budget estimates rather than reusing old numbers.
Opus 4.1 retires August 5, 2026; Opus 4.7/4.6/4.5 and Sonnet 4.6/4.5 are now legacy; dateless model ids are pinned snapshots since the 4.6 generation, not evergreen pointers.

Refusals, fallbacks and the Fable 5 contract

Claude Fable 5 launched on June 9, 2026 and was updated again at the July 1, 2026 redeployment. It changed how the Claude API handles safety declines. When one of Fable 5's dual-use safety classifiers (automated filters that check a request against policy categories like cyber or bio risk before or during generation) fires, the API does not return an error code. It returns a normal HTTP 200 with stop_reason set to "refusal", and it reports which classifier triggered. This matters for billing: a request refused before any output was produced is not billed at all. If the refusal happens mid-stream, after some tokens were already generated, those streamed tokens are billed as usual. A builder who only checks for HTTP error codes will miss every one of these events, because the request technically succeeded.

Once a refusal happens, you have three retry paths, and Anthropic recommends trying them in order. The first and preferred path is the fallbacks parameter, a beta feature on the Claude API and on Claude Platform on AWS (Anthropic's own AWS-hosted offering, distinct from Amazon Bedrock). You declare one or more fallback models in the request, and if the primary model refuses, the platform automatically retries the request against the fallback model for you, inside the same call. No extra round trip, no client-side logic. The second path is SDK middleware, available for TypeScript, Python, Go, Java, and C#. This is client-side code that intercepts a refusal and re-issues the request itself, useful when the server-side parameter is not available on your deployment target. The third path is manual handling in your own code: catching the refusal, deciding what to do, and calling the API again yourself. Each path trades convenience for control, and you should default to the first one unless you have a specific reason not to.

A detail that removes a common objection to retrying: there is a fallback credit. When you switch from one model to another mid-conversation, you normally lose your prompt cache (the discounted-rate reuse of a previously-processed prompt prefix) and pay full price to rebuild it on the new model. With the fallback credit, the cost of that cache-switch is refunded when the fallback is triggered by a refusal. This means opting into fallbacks is close to free from a cost perspective, which is why Anthropic recommends treating it as a default rather than an opt-in you have to justify.

Thinking on Fable 5 works differently from earlier Claude models. Adaptive thinking is the only mode: Fable 5 decides on its own when and how much to reason before answering, and you cannot pass a parameter to disable thinking entirely. This is a change from older extended-thinking setups where a developer set a fixed token budget for reasoning. A second change: the raw chain of thought is never returned to the caller, regardless of settings. What you can control is thinking.display, which takes two values: "summarized" gives you a readable, shortened version of the reasoning, and "omitted" (the default) gives you nothing. If your product shows users a live view of Claude's reasoning process, you must explicitly request "summarized" or the field will simply be empty.

Data handling on Fable 5 has one constraint that compliance-sensitive builders need to flag early. Fable 5, along with its sibling model Mythos 5, is classified as a Covered Model, which comes with mandatory 30-day data retention and, critically, no zero-data-retention (ZDR) option. Organizations that require ZDR for regulatory or contractual reasons (finance, healthcare, government work) cannot currently deploy Fable 5 in that mode. This is not a configuration you can flip; it is a property of the model at this stage of its rollout. Anyone architecting a Fable 5 integration for a regulated client should surface this constraint before writing a single line of integration code, not after.

On the feature side, Fable 5 supports a fairly complete set of API capabilities at launch: the effort parameter (controls how much computational effort the model spends on a task), task budgets (a beta feature for capping spend on long agentic runs), the memory tool (lets Claude persist notes across sessions), code execution, programmatic tool calling (Claude writes code that calls your tools directly instead of going through a full round trip each time), context editing, compaction (summarizing old conversation turns to save context space), and vision. Pricing is $10 per million input tokens and $50 per million output tokens, which is a premium tier compared to earlier Claude models, so builders should weigh whether a task genuinely needs Fable 5's reasoning depth or would run acceptably on a cheaper model.

The defensive pattern to build into every Fable 5 integration: always branch explicitly on stop_reason rather than assuming content is populated. Check for "end_turn" (normal completion), "max_tokens" (output was cut off because it hit the length limit), and "refusal" (a classifier blocked the request) as distinct cases, each needing different handling. Log the classifier name whenever a refusal occurs, since that is your audit trail if a customer complains their legitimate request was blocked. Finally, decide per use case whether falling back to Opus 4.8 is an acceptable outcome to serve silently to the user, or whether the request should instead surface a visible error so a human can review it. A financial-compliance tool and a casual chatbot should probably make opposite choices here.

Key points

Refusals return HTTP 200 with stop_reason "refusal" and a named classifier, not an error code; unbilled if no output was produced
Prefer the server-side fallbacks parameter first, then SDK middleware, then manual handling; a fallback credit refunds the prompt-cache cost of switching models
Fable 5 always thinks (adaptive only, cannot disable) and never returns raw chain of thought; thinking.display defaults to "omitted"
Fable 5 and Mythos 5 require 30-day data retention with no zero-data-retention option, a hard constraint for regulated builders

Work with me

Need this level of execution on your project?

I am Pierre Bottazzi. I built this entire course solo, end to end: 237 lessons in 3 languages, the app, the design, the SEO, the accounts system. That is what I do for clients too: web apps, mobile apps, AI automation, SEO/GEO. First call is free, no strings attached.

Contact me on LinkedIn See sept-tools.com (industry)See totemsauvage.com (art gallery)

Inspiration

Inspired by 0xloucash

One of my inspirations. Loucash (0xloucash) has a gift for always digging up the sharpest AI tips and tricks, then turning them into setups that actually work. With InstallClaw he configures your own OpenClaw AI agent, at your place, in 48 hours.

His Instagram InstallClaw