The Messages API is the core HTTP endpoint that lets any program talk to Claude. Instead of opening a chat window, your code sends a structured JSON request and receives a structured JSON response. JSON (JavaScript Object Notation) is a standard text format for exchanging data.
Every request must include three things: the model id (which Claude version to use), max_tokens (the maximum number of tokens, or word-pieces, Claude may generate in the reply), and a messages array (the conversation history as a list of role/content pairs).
The official Anthropic SDK (Software Development Kit) for Node.js wraps this HTTP call in a simple JavaScript function. Install it with npm, then write a few lines:
Set your API key as an environment variable called ANTHROPIC_API_KEY.
Create a client: const Anthropic = require("@anthropic-ai/sdk"); const client = new Anthropic();
Call client.messages.create({ ... }) with your model, max_tokens, and messages.
Read the reply from response.content[0].text.
The model ids for June 2026 are claude-opus-4-8 (most capable), claude-sonnet-4-6 (balanced), and claude-haiku-4-5 (fastest). Start with Haiku while learning: it is cheap and instant.
Key points
Messages API: the HTTP endpoint your code calls to reach Claude
max_tokens controls the maximum length of Claude's reply
messages array holds the conversation as role/content pairs
ANTHROPIC_API_KEY must be set before any call will succeed
System, user, assistant over the API
Every call to the Claude API is built from a sequence of messages, each tagged with a role. The three roles are system, user, and assistant. The system role is a special top-level parameter (not part of the messages array) that sets persistent instructions for the entire conversation. Think of it as the briefing you give Claude before the conversation begins.
The messages array then alternates between user (the human turn) and assistant (Claude's reply). You can pre-fill this array with past turns to simulate a multi-turn conversation, or inject a partial assistant turn to steer the very first word of the response.
Why does role order matter? Claude is trained to respect the hierarchy: system instructions carry the highest weight, then the conversation history. If a user message conflicts with the system prompt, Claude follows the system prompt. This makes the system parameter the right place for rules, personas, output formats, and safety guardrails.
system: top-level string, set once per request, never shown as a message bubble.
user: a human turn, required at least once as the final message.
assistant: Claude's previous replies, or a prefill string to constrain the next reply.
Messages must alternate user/assistant; two consecutive user turns are rejected by the API.
Key points
system parameter sets persistent rules for the whole request
messages array must alternate user and assistant roles
prefilling the assistant turn constrains Claude's first token
system outranks user when instructions conflict
Tool use over the API
The Claude API lets you give the model a list of tools (also called function definitions) it can invoke. Each tool is a JSON object describing a name, a description, and an input schema (a JSON Schema object that tells Claude what parameters the tool accepts). Claude never runs the tool itself; it returns a structured tool_use block that your code must handle.
A typical round-trip works like this:
You send a messages request that includes a tools array.
If Claude decides to call a tool, the response stop_reason is "tool_use" and the content contains a tool_use block with an id, the tool name, and the input object.
Your code executes the real action (database query, API call, calculation), then appends a tool_result block to the conversation using the same tool_use_id.
You send the updated conversation back to Claude, which reads the result and produces its final answer.
Two key design choices affect reliability. First, write the tool description as if you are explaining the function to a junior colleague: Claude uses it to decide when and whether to call the tool. Second, keep your input schema strict: mark required fields, use enum where values are fixed, and avoid vague string fields when a number or boolean is correct. Vague schemas produce vague inputs.
When you need Claude to call exactly one specific tool, set tool_choice to {"type": "tool", "name": "your_tool_name"}. The default "auto" lets Claude decide. Use "any" to force at least one tool call without specifying which one.
Key points
Declare tools as JSON Schema objects in the <code>tools</code> array
Claude returns a <code>tool_use</code> block; your code runs the action
Send the result back as a <code>tool_result</code> block to continue
Use <code>tool_choice</code> to control whether Claude must call a tool
Streaming responses
By default, the Anthropic API waits until the model finishes generating before sending anything back. Streaming changes that: the API sends each token (a word fragment, roughly 3 to 4 characters) to your client the moment it is produced, so the user sees text appearing word by word instead of waiting for the full reply.
Streaming uses the Server-Sent Events (SSE) protocol. The server keeps the HTTP connection open and pushes small event chunks down the wire. Each chunk carries a delta, which is the incremental new text to append. Your client accumulates deltas to reconstruct the full message.
To enable streaming with the Anthropic Python or Node SDK, pass stream=True (Python) or use the .stream() method (Node). The SDK exposes an async iterator so you process one chunk at a time without buffering the whole response in memory. That matters for long outputs: a 4000-token reply can start rendering in under a second instead of waiting several seconds for completion.
stream=True (Python): returns a context manager; iterate text_stream for raw text chunks.
.stream() (Node/TS): returns an async iterable; use for await to consume chunks.
Usage stats arrive in the final message_stop event, not up front.
Key points
Streaming sends tokens as they are generated, not after completion.
Server-Sent Events (SSE) keep one HTTP connection open for all chunks.
Each chunk carries a delta: the new text fragment to append.
Final token usage counts arrive only in the last event.
Prompt caching API
Every API call re-processes every token you send. Prompt caching lets you mark stable sections of your request so Anthropic stores a compiled version on their servers. Subsequent calls that hit the same prefix skip re-processing and pay a much lower rate: roughly 10 percent of the normal input cost for cache hits, versus 125 percent for the initial write that populates the cache.
You mark a cacheable boundary by adding "cache_control": {"type": "breakpoint"} inside a content block. Claude reads your prompt top-to-bottom and caches everything up to that marker. You can place up to four breakpoints per request. The most common pattern is one breakpoint after a long system prompt or a large document you reuse across many calls.
A few rules govern when the cache is actually used:
The prefix must be at least 1024 tokens (approximately 750 words) to qualify for caching.
Cache entries expire after five minutes of inactivity; each hit resets the timer.
The model, version, and all content before the breakpoint must be byte-identical across calls.
The API response includes a usage object with cache_creation_input_tokens and cache_read_input_tokens so you can verify hits and measure savings in real time.
Key points
Add cache_control breakpoint to stable content blocks
Prefix must be 1024+ tokens to qualify
Cache hit costs ~10% of normal input price
Check usage.cache_read_input_tokens to confirm hits
The Batch API
The Batch API lets you submit up to 10,000 requests in a single call and get all the results back asynchronously (meaning you do not wait for a live reply; you check back later). In exchange for this flexibility, Anthropic charges 50% less per token than the standard real-time API.
You send a JSON file containing a list of requests, each with its own unique custom_id so you can match results to inputs. Claude processes them in the background, typically within a few minutes for hundreds of requests, though the SLA (Service Level Agreement, the official time guarantee) allows up to 24 hours.
The Batch API has its own independent rate limit, separate from the real-time API. This means heavy batch work does not eat into your interactive quota. It is ideal for any offline task: generating datasets, running evaluations, translating large catalogs, or classifying thousands of records.
Discount: 50% off input and output tokens vs. real-time pricing
Result format: one JSONL line per request, matched by custom_id
Cancellation: you can cancel a batch in flight with a single API call
Key points
Batch API cuts token costs by 50% for asynchronous workloads
Each request in a batch carries a custom_id for result matching
Batch rate limits are independent from real-time rate limits
Results arrive as a JSONL file, not a streaming response
Counting tokens
Before you send a request to Claude, you can ask the API to count exactly how many tokens (the chunks of text the model reads and writes) that request will consume. This uses the token counting endpoint: POST /v1/messages/count_tokens. It accepts the same body as a normal messages request but returns only a count, never a response, and costs nothing.
Token counts matter for two reasons. First, every model has a context window (the maximum tokens it can see at once): 200,000 for Opus and Sonnet, 200,000 for Haiku. Second, you are billed per input and output token, so over-sending wastes money and under-sending may truncate your prompt. Counting lets you stay under the limit and forecast cost before committing.
Key things you can count before sending:
System prompt alone, to understand its fixed overhead.
Tool definitions, which often surprise developers by being large.
Conversation history, to decide when to summarize or drop old turns.
Uploaded files or long documents, to verify they fit.
For token budgeting, set a soft ceiling in your code: if input_tokens from the count endpoint exceeds, say, 150,000, truncate or summarize before sending. You can also pair counting with the max_tokens parameter (which caps output length) to control total spend per call precisely.
Key points
Token counting endpoint: POST /v1/messages/count_tokens
Context window: 200,000 tokens for Opus, Sonnet, and Haiku (as of mid-2026)
Count before sending to catch overflows and forecast cost
Use max_tokens to cap output and control spend
Model ids, pricing and migration
Every Claude model has a model id, the exact string you pass to the API to request a specific version. As of June 2026, the three current ids are claude-opus-4-8 (most capable, highest cost), claude-sonnet-4-6 (balanced performance and cost), and claude-haiku-4-5 (fastest, lowest cost). Always use the full versioned id in production code, never an alias like "claude-opus" without a version suffix, because Anthropic may silently reroute aliases to newer models and change your costs or behavior.
Choosing the right model is a cost-performance trade-off. A practical rule of thumb:
Opus (claude-opus-4-8): architecture decisions, complex reasoning, long document analysis, agentic loops where quality matters most.
Sonnet (claude-sonnet-4-6): everyday coding tasks, summarization, drafting, multi-step workflows where speed and cost matter.
Haiku (claude-haiku-4-5): classification, routing, quick lookups, high-volume batch jobs where latency is critical.
Migration means switching your codebase from an old model id to a newer one. The safe pattern is: update the model id string, run your existing eval or test suite against the new model, compare outputs on a sample of real prompts, then ship. Because newer models may refuse differently or format output differently, never migrate without a comparison step. Anthropic publishes a migration guide for each generation; check it for breaking changes in tool-call formats or context-window sizes before you switch.
Pricing is per-token (a token is roughly four characters of English text). You pay separately for input tokens (what you send) and output tokens (what the model returns). Output tokens cost more. Use prompt caching to reuse a large system prompt across calls and cut input costs by up to 90 percent on the cached portion. The Anthropic Batch API gives a 50 percent discount on both input and output at the cost of higher latency, ideal for offline dataset generation.
Key points
Use full versioned model ids, never bare aliases, in production.
Opus for quality, Sonnet for balance, Haiku for speed and volume.
Always run an eval comparison before migrating to a new model id.
Prompt caching and the Batch API are the two main cost levers.
Vision and PDF inputs
The Claude API accepts more than text. You can send images and PDF files directly in the messages array, alongside or instead of a text prompt. The model reads the visual content and reasons over it just as it would over written words. This capability is called multimodal input (multi-format, not text-only).
Images are passed as base64-encoded strings (a way of representing binary file data as plain ASCII text) inside a content block with "type": "image". You specify the media type such as image/jpeg, image/png, image/gif, or image/webp. Alternatively you can pass a public URL using "type": "image" with a "url" source instead of base64.
PDFs use "type": "document" with "media_type": "application/pdf" and the file content as base64. Claude reads the full text layer of the PDF and, when pages contain diagrams or charts, also interprets those visually. PDFs are capped at 100 pages and roughly 32 MB per file.
Supported image formats: JPEG, PNG, GIF, WebP.
Max image size per request: 20 MB (base64 encoded weight is about 33 percent larger than the raw file).
Up to 20 images per request on current models.
PDFs: max 100 pages, 32 MB raw. Text and visual content both parsed.
Vision is available on claude-opus-4-8, claude-sonnet-4-6, and claude-haiku-4-5.
Key points
Pass images via base64 or URL in a content block with type:image
PDFs use type:document and media_type:application/pdf
Limits: 20 images per request, PDFs up to 100 pages and 32 MB
Vision works on all three current Claude model tiers
Work with me
Master Claude, Claude Code and LLMs, from your first prompt to multi-agent orchestration.
Like this course? I built it end to end. Need a web app, mobile app, AI automation or SEO/GEO? Let us talk.