The Claude Bible
Home / The Claude API for builders
Level: Expert · 9 lessons

The Claude API for builders

Calling Claude directly: messages, tools, streaming, batch.

Open the interactive course212 lessons, quizzes, exercises, 3 languages, free.

The Messages API

The Messages API is the core HTTP endpoint that lets any program talk to Claude. Instead of opening a chat window, your code sends a structured JSON request and receives a structured JSON response. JSON (JavaScript Object Notation) is a standard text format for exchanging data.

Every request must include three things: the model id (which Claude version to use), max_tokens (the maximum number of tokens, or word-pieces, Claude may generate in the reply), and a messages array (the conversation history as a list of role/content pairs).

The official Anthropic SDK (Software Development Kit) for Node.js wraps this HTTP call in a simple JavaScript function. Install it with npm, then write a few lines:

The model ids for June 2026 are claude-opus-4-8 (most capable), claude-sonnet-4-6 (balanced), and claude-haiku-4-5 (fastest). Start with Haiku while learning: it is cheap and instant.

Key points
  • Messages API: the HTTP endpoint your code calls to reach Claude
  • max_tokens controls the maximum length of Claude's reply
  • messages array holds the conversation as role/content pairs
  • ANTHROPIC_API_KEY must be set before any call will succeed

System, user, assistant over the API

Every call to the Claude API is built from a sequence of messages, each tagged with a role. The three roles are system, user, and assistant. The system role is a special top-level parameter (not part of the messages array) that sets persistent instructions for the entire conversation. Think of it as the briefing you give Claude before the conversation begins.

The messages array then alternates between user (the human turn) and assistant (Claude's reply). You can pre-fill this array with past turns to simulate a multi-turn conversation, or inject a partial assistant turn to steer the very first word of the response.

Why does role order matter? Claude is trained to respect the hierarchy: system instructions carry the highest weight, then the conversation history. If a user message conflicts with the system prompt, Claude follows the system prompt. This makes the system parameter the right place for rules, personas, output formats, and safety guardrails.

Key points
  • system parameter sets persistent rules for the whole request
  • messages array must alternate user and assistant roles
  • prefilling the assistant turn constrains Claude's first token
  • system outranks user when instructions conflict

Tool use over the API

The Claude API lets you give the model a list of tools (also called function definitions) it can invoke. Each tool is a JSON object describing a name, a description, and an input schema (a JSON Schema object that tells Claude what parameters the tool accepts). Claude never runs the tool itself; it returns a structured tool_use block that your code must handle.

A typical round-trip works like this:

  1. You send a messages request that includes a tools array.
  2. If Claude decides to call a tool, the response stop_reason is "tool_use" and the content contains a tool_use block with an id, the tool name, and the input object.
  3. Your code executes the real action (database query, API call, calculation), then appends a tool_result block to the conversation using the same tool_use_id.
  4. You send the updated conversation back to Claude, which reads the result and produces its final answer.

Two key design choices affect reliability. First, write the tool description as if you are explaining the function to a junior colleague: Claude uses it to decide when and whether to call the tool. Second, keep your input schema strict: mark required fields, use enum where values are fixed, and avoid vague string fields when a number or boolean is correct. Vague schemas produce vague inputs.

When you need Claude to call exactly one specific tool, set tool_choice to {"type": "tool", "name": "your_tool_name"}. The default "auto" lets Claude decide. Use "any" to force at least one tool call without specifying which one.

Key points
  • Declare tools as JSON Schema objects in the <code>tools</code> array
  • Claude returns a <code>tool_use</code> block; your code runs the action
  • Send the result back as a <code>tool_result</code> block to continue
  • Use <code>tool_choice</code> to control whether Claude must call a tool

Streaming responses

By default, the Anthropic API waits until the model finishes generating before sending anything back. Streaming changes that: the API sends each token (a word fragment, roughly 3 to 4 characters) to your client the moment it is produced, so the user sees text appearing word by word instead of waiting for the full reply.

Streaming uses the Server-Sent Events (SSE) protocol. The server keeps the HTTP connection open and pushes small event chunks down the wire. Each chunk carries a delta, which is the incremental new text to append. Your client accumulates deltas to reconstruct the full message.

To enable streaming with the Anthropic Python or Node SDK, pass stream=True (Python) or use the .stream() method (Node). The SDK exposes an async iterator so you process one chunk at a time without buffering the whole response in memory. That matters for long outputs: a 4000-token reply can start rendering in under a second instead of waiting several seconds for completion.

Key points
  • Streaming sends tokens as they are generated, not after completion.
  • Server-Sent Events (SSE) keep one HTTP connection open for all chunks.
  • Each chunk carries a delta: the new text fragment to append.
  • Final token usage counts arrive only in the last event.

Prompt caching API

Every API call re-processes every token you send. Prompt caching lets you mark stable sections of your request so Anthropic stores a compiled version on their servers. Subsequent calls that hit the same prefix skip re-processing and pay a much lower rate: roughly 10 percent of the normal input cost for cache hits, versus 125 percent for the initial write that populates the cache.

You mark a cacheable boundary by adding "cache_control": {"type": "breakpoint"} inside a content block. Claude reads your prompt top-to-bottom and caches everything up to that marker. You can place up to four breakpoints per request. The most common pattern is one breakpoint after a long system prompt or a large document you reuse across many calls.

A few rules govern when the cache is actually used:

The API response includes a usage object with cache_creation_input_tokens and cache_read_input_tokens so you can verify hits and measure savings in real time.

Key points
  • Add cache_control breakpoint to stable content blocks
  • Prefix must be 1024+ tokens to qualify
  • Cache hit costs ~10% of normal input price
  • Check usage.cache_read_input_tokens to confirm hits

The Batch API

The Batch API lets you submit up to 10,000 requests in a single call and get all the results back asynchronously (meaning you do not wait for a live reply; you check back later). In exchange for this flexibility, Anthropic charges 50% less per token than the standard real-time API.

You send a JSON file containing a list of requests, each with its own unique custom_id so you can match results to inputs. Claude processes them in the background, typically within a few minutes for hundreds of requests, though the SLA (Service Level Agreement, the official time guarantee) allows up to 24 hours.

The Batch API has its own independent rate limit, separate from the real-time API. This means heavy batch work does not eat into your interactive quota. It is ideal for any offline task: generating datasets, running evaluations, translating large catalogs, or classifying thousands of records.

Key points
  • Batch API cuts token costs by 50% for asynchronous workloads
  • Each request in a batch carries a custom_id for result matching
  • Batch rate limits are independent from real-time rate limits
  • Results arrive as a JSONL file, not a streaming response

Counting tokens

Before you send a request to Claude, you can ask the API to count exactly how many tokens (the chunks of text the model reads and writes) that request will consume. This uses the token counting endpoint: POST /v1/messages/count_tokens. It accepts the same body as a normal messages request but returns only a count, never a response, and costs nothing.

Token counts matter for two reasons. First, every model has a context window (the maximum tokens it can see at once): 200,000 for Opus and Sonnet, 200,000 for Haiku. Second, you are billed per input and output token, so over-sending wastes money and under-sending may truncate your prompt. Counting lets you stay under the limit and forecast cost before committing.

Key things you can count before sending:

For token budgeting, set a soft ceiling in your code: if input_tokens from the count endpoint exceeds, say, 150,000, truncate or summarize before sending. You can also pair counting with the max_tokens parameter (which caps output length) to control total spend per call precisely.

Key points
  • Token counting endpoint: POST /v1/messages/count_tokens
  • Context window: 200,000 tokens for Opus, Sonnet, and Haiku (as of mid-2026)
  • Count before sending to catch overflows and forecast cost
  • Use max_tokens to cap output and control spend

Model ids, pricing and migration

Every Claude model has a model id, the exact string you pass to the API to request a specific version. As of June 2026, the three current ids are claude-opus-4-8 (most capable, highest cost), claude-sonnet-4-6 (balanced performance and cost), and claude-haiku-4-5 (fastest, lowest cost). Always use the full versioned id in production code, never an alias like "claude-opus" without a version suffix, because Anthropic may silently reroute aliases to newer models and change your costs or behavior.

Choosing the right model is a cost-performance trade-off. A practical rule of thumb:

Migration means switching your codebase from an old model id to a newer one. The safe pattern is: update the model id string, run your existing eval or test suite against the new model, compare outputs on a sample of real prompts, then ship. Because newer models may refuse differently or format output differently, never migrate without a comparison step. Anthropic publishes a migration guide for each generation; check it for breaking changes in tool-call formats or context-window sizes before you switch.

Pricing is per-token (a token is roughly four characters of English text). You pay separately for input tokens (what you send) and output tokens (what the model returns). Output tokens cost more. Use prompt caching to reuse a large system prompt across calls and cut input costs by up to 90 percent on the cached portion. The Anthropic Batch API gives a 50 percent discount on both input and output at the cost of higher latency, ideal for offline dataset generation.

Key points
  • Use full versioned model ids, never bare aliases, in production.
  • Opus for quality, Sonnet for balance, Haiku for speed and volume.
  • Always run an eval comparison before migrating to a new model id.
  • Prompt caching and the Batch API are the two main cost levers.

Vision and PDF inputs

The Claude API accepts more than text. You can send images and PDF files directly in the messages array, alongside or instead of a text prompt. The model reads the visual content and reasons over it just as it would over written words. This capability is called multimodal input (multi-format, not text-only).

Images are passed as base64-encoded strings (a way of representing binary file data as plain ASCII text) inside a content block with "type": "image". You specify the media type such as image/jpeg, image/png, image/gif, or image/webp. Alternatively you can pass a public URL using "type": "image" with a "url" source instead of base64.

PDFs use "type": "document" with "media_type": "application/pdf" and the file content as base64. Claude reads the full text layer of the PDF and, when pages contain diagrams or charts, also interprets those visually. PDFs are capped at 100 pages and roughly 32 MB per file.

Key points
  • Pass images via base64 or URL in a content block with type:image
  • PDFs use type:document and media_type:application/pdf
  • Limits: 20 images per request, PDFs up to 100 pages and 32 MB
  • Vision works on all three current Claude model tiers
Work with me

Master Claude, Claude Code and LLMs, from your first prompt to multi-agent orchestration.

Like this course? I built it end to end. Need a web app, mobile app, AI automation or SEO/GEO? Let us talk.

Contact me on LinkedInSee a site I built