The Claude Bible
Home / Other LLMs and routing
Level: Expert · 10 lessons

Other LLMs and routing

The model landscape, the refusal map, picking the right tool for each task.

Open the interactive course212 lessons, quizzes, exercises, 3 languages, free.

The landscape: Claude, GPT, Gemini, and the rest

Claude is not alone. Knowing the landscape lets you pick the right model per task and transfer techniques (a good prompt stays good everywhere). The big families:

Pierre's CL4R1T4S corpus collects the real system prompts of all these vendors (and of the Cursor, v0, Lovable, Devin, Perplexity tools...). His Pattern Bank extracts 75+ reusable patterns from them, sorted into 13 categories. The meta-lesson: the best prompting practices are the same everywhere, because all these models share the next-token-prediction nature. Learning Claude is learning to talk to all of them.

Key points
  • Families: Claude (code/reasoning/safety), GPT (ecosystem), Gemini (giant context), open-weights (local), Grok (permissive)
  • Prompting techniques transfer: same next-token-prediction nature
  • CL4R1T4S: a collection of real system prompts + a Pattern Bank of 75+ patterns
  • Learning Claude = learning to talk to all LLMs

Refusal map and tactical routing

Each model has its refusal policy declared in its system prompt. Pierre's Refusal Map is a matrix of 14 vendors x 14 categories (copyright, cyber offense/defense, PII, medical, legal, financial, etc.) that summarizes, per model, what is strict, soft, or open.

The use is tactical routing: for a legitimate task that hits an overly broad refusal, pick the model whose policy leaves the appropriate latitude. Documented examples:

Two honest warnings Pierre notes himself:

The real transferable lesson: giving legitimate context gets a request through. "Security audit of my own site, here is the authorization" is treated very differently from the same request without a frame. Honest framing is the first key, well before the choice of model.

Key points
  • Refusal Map = matrix of vendors x refusal categories (strict/soft/open)
  • Tactical routing: for a legitimate task, pick the model with the right latitude
  • Declarative map, not behavioral: test for real; hard lines uncrossable everywhere
  • Legitimate, honest framing unblocks more than the choice of model

GPT-5 and the OpenAI family

OpenAI's GPT-5 (released 2025) is the current flagship of the GPT series. It competes directly with Claude Opus in reasoning, long-form writing, and multimodal tasks (meaning it can process both text and images). GPT-5 is notably strong at instruction-following for mass consumer use cases and is deeply integrated into Microsoft products via Azure OpenAI Service.

The OpenAI family is organised in tiers, much like Claude's Opus / Sonnet / Haiku ladder:

Compared to Claude, GPT-5 tends to be more permissive on edge-case content and is optimised for breadth of user tasks. Claude (especially Opus, model id claude-opus-4-8) is generally preferred for nuanced long documents, strict instruction chains, and agentic coding workflows where refusals and hallucinations (invented facts) carry high cost. The two model families differ most visibly in context window handling: Claude 3.x and 4.x support up to 200 000 tokens of context, while GPT-5 supports 128 000 tokens in most API configurations.

When routing tasks across models, the practical question is not "which is smarter" but "which is more reliable for this specific task at this cost." GPT-5 via the OpenAI API and Claude via the Anthropic API are both callable from the same orchestration code, so real-world systems often use both, assigning tasks by strength.

Key points
  • GPT-5 is OpenAI's flagship, competitive with Claude Opus on reasoning and multimodal tasks.
  • The OpenAI tier ladder: GPT-5, GPT-4o, o3/o4-mini (reasoning), GPT-4o mini.
  • Claude supports up to 200k token context; GPT-5 API caps at 128k in most configurations.
  • Route by task fit and cost, not by a single 'best model' verdict.

Gemini and very long context

Google's Gemini family (Ultra, Pro, Flash) is the main rival to Claude and GPT-4 class models. Its defining feature is an enormous context window (the maximum amount of text, code, or data a model can read in one request). As of mid-2026, Gemini 1.5 Pro supports up to 1 million tokens, and Gemini 1.5 Flash up to 1 million tokens at lower cost. For reference, one token is roughly 3 to 4 characters of English text, so 1 million tokens fits several large novels or an entire mid-size codebase.

When does a long context window actually matter? It matters when you cannot break your input into smaller chunks without losing meaning. Common cases include:

Claude models (Opus claude-opus-4-8, Sonnet claude-sonnet-4-6) offer up to 200 000 tokens of context, which covers most professional tasks. Gemini's edge is the cases where even 200 000 tokens is not enough. The practical tradeoff: quality of reasoning tends to be higher in Claude and GPT-4 class models on complex multi-step tasks, while Gemini Flash trades some reasoning depth for speed and price at scale.

Key points
  • Gemini Pro and Flash: up to 1 million token context window
  • Context window size matters most when input cannot be chunked
  • Claude tops out near 200 000 tokens, strong reasoning quality
  • Choose the model by task shape, not brand loyalty

Open models: Llama, Mistral

Open-weights models are AI models whose trained parameters (the numerical values that define the model's behavior) are released publicly, so anyone can download and run them locally or on their own servers. The two most prominent families are Meta Llama (Llama 3, Llama 4) and Mistral (Mistral 7B, Mixtral, Mistral Large). Unlike Claude or GPT, no API key or monthly subscription is required to run them once downloaded.

The core trade-off is control versus capability. Open-weights models give you full data privacy (nothing leaves your machine), zero per-token cost at inference time, and the ability to fine-tune (re-train on your own data) for a specific domain. The cost is that you supply the hardware, manage updates, and accept that frontier capability still lags behind the top proprietary models like Claude Opus or GPT-4o as of mid-2026.

When routing a workload, prefer open-weights models when one or more of these conditions apply:

A practical stack: run Ollama (a local model server, free) to serve Llama or Mistral on your laptop or a rented GPU, then point your code at http://localhost:11434 using the same OpenAI-compatible API shape. For production, quantized (compressed) 4-bit versions of Llama 3 8B run on a single consumer GPU with 8 GB VRAM.

Key points
  • Open-weights: parameters are public, self-hostable
  • Best for: privacy, high volume, fine-tuning, offline
  • Ollama serves Llama/Mistral locally via REST API
  • Trade-off: control and cost vs. frontier capability

Running a model locally with Ollama

Local inference means running an AI model entirely on your own machine, so no data ever leaves your hardware. Ollama is the most popular tool for this: it downloads open-weight models (models whose weights are publicly released), manages them like Docker images, and exposes a local REST API on port 11434.

The core tradeoff is capability versus control. Cloud models like Claude Opus or GPT-4 run on provider servers and give you the best reasoning at the cost of sending your text to a third party. Local models run on your CPU or GPU with zero network calls, but they are smaller and less capable for complex reasoning tasks.

Key use cases for local inference:

The main models available through Ollama include Llama 3 (Meta), Mistral, Gemma (Google), Phi-3 (Microsoft), and many fine-tuned variants. None of them match Claude Opus on hard reasoning today, but they are entirely adequate for classification, summarization, templated extraction, and code completion on familiar patterns.

Key points
  • Local inference: model runs on your hardware, no data sent out
  • Ollama manages open-weight models and serves a local API
  • Tradeoff: privacy and zero cost per call vs. lower capability
  • Best for sensitive data, offline use, or very high call volumes

Composing a multi-vendor system prompt

No single vendor's default behavior is optimal for every task. Multi-vendor prompt composition means reading the published or reverse-engineered system prompts of several AI products, extracting the rules that matter for your use case, and merging them into one coherent system prompt you control.

Each vendor has solved a different problem well. Cursor (an AI code editor) enforces strict file-edit discipline: it never rewrites a file it has not read first and always shows a diff before applying changes. Perplexity enforces inline citation: every factual claim carries a numbered source reference. GPT-4o's system prompt enforces anti-hedging: it forbids phrases like "I think" or "I'm not sure" when the model has enough context to be direct. Cline and Devin enforce autonomous loop discipline: the model must declare a plan, execute it step by step, and halt only on ambiguity or cost gates.

When you combine these into one system prompt for Claude (using claude-opus-4-8 for complex reasoning or claude-sonnet-4-6 for speed), you get a single agent that cites sources, edits files safely, stays direct, and runs autonomously without constant confirmation prompts. The technique is sometimes called a Frankenstein prompt because it stitches rules from multiple sources into one body.

Key points
  • Extract the strongest rule from each vendor prompt
  • Merge rules into one system prompt without contradiction
  • Test the composed prompt against a real task before deploying
  • Owner rules always override vendor defaults

Tactical routing per task

Not every task deserves the same model. Tactical routing means choosing the model whose strengths match the job at hand, so you spend compute where it pays off and avoid paying a premium for tasks that need no deep reasoning.

The three tiers in June 2026 are: Opus (claude-opus-4-8) for complex reasoning, architecture, and judgment calls; Sonnet (claude-sonnet-4-6) for the broad middle ground of coding, drafting, and analysis; Haiku (claude-haiku-4-5) for fast, high-volume, simple tasks such as classification or extraction. Routing wrong in either direction costs you: using Opus to rename a variable wastes budget, using Haiku to design a distributed system risks shallow output.

A practical routing heuristic covers four signals:

In Claude Code (the CLI and IDE coding agent) you switch model with the --model flag or the /model command inside a session. Agents and pipelines built on the Anthropic API can route programmatically by passing the model parameter per request, so a single pipeline can use Haiku for pre-filtering and Opus only for the final judgment step.

Key points
  • Tactical routing: matching model tier to task requirements
  • Opus for judgment, Sonnet for breadth, Haiku for speed
  • Four routing signals: stakes, novelty, output length, latency
  • Claude Code --model flag lets you switch per session or per call

Refusal differences across vendors

Every major LLM vendor trains its model with a refusal policy: a set of rules that cause the model to decline certain requests. These policies differ in scope, tone, and consistency. Knowing the differences lets you route tasks to the model most likely to complete them without friction.

The main dimensions where vendors diverge are listed below. A hard refusal means the model will not comply regardless of how the prompt is phrased. A soft refusal means the model resists by default but can be unlocked with system-prompt context, role assignment, or explicit permission from the API caller.

The practical routing strategy is: use a system prompt to establish context before the refusal happens. If a model still refuses after context-setting, switch vendors rather than trying to trick the model with prompt injection (a technique that tries to override instructions by hiding commands in the input), which is unreliable and violates terms of service.

Key points
  • Hard vs. soft refusals depend on vendor policy and operator context
  • System-prompt context is the legitimate unlock mechanism
  • Open-weight models have no enforced refusal policy
  • Route by task type: choose the vendor whose policy fits the use case

Fine-tuning vs prompting

A large language model (LLM) like Claude can follow instructions written in plain text, a technique called prompting. Fine-tuning is different: you take an existing model and continue training it on your own dataset so the weights themselves change. Both approaches can make a model behave the way you want, but they solve different problems.

Prompting wins in most cases because it is fast, cheap, and reversible. You iterate in minutes, pay only for inference (the compute used when the model answers), and switch models without losing anything. Fine-tuning requires collecting hundreds or thousands of labeled examples, paying for GPU training time, hosting the resulting model, and repeating the whole process whenever your needs change.

Fine-tuning earns its cost in a narrow set of situations:

A practical rule: exhaust prompting first. Use system prompts, few-shot examples (a handful of input/output pairs included in the prompt), and retrieval-augmented generation (RAG) (fetching relevant documents at runtime) before touching fine-tuning. Fine-tuning fixes behavior; prompting shapes it. If the gap between what the model does and what you need is a matter of knowledge or style that fits in a context window, prompting is almost always the right answer.

Key points
  • Prompting is fast and reversible; prefer it by default
  • Fine-tune only for scale, strict output format, domain vocabulary, or data privacy
  • Few-shot examples and RAG can replace fine-tuning in many cases
  • Fine-tuned small models cut cost at high request volume
Work with me

Master Claude, Claude Code and LLMs, from your first prompt to multi-agent orchestration.

Like this course? I built it end to end. Need a web app, mobile app, AI automation or SEO/GEO? Let us talk.

Contact me on LinkedInSee a site I built