The landscape: Claude, GPT, Gemini, and the rest

Claude is not alone. Knowing the landscape lets you pick the right model per task and transfer techniques (a good prompt stays good everywhere). The big families:

Anthropic Claude: the reference for code, long reasoning, instruction-following and safety. Strong at agentics (Claude Code, Cowork).
OpenAI GPT: a wide ecosystem, multimodal, mature function calling, the Atlas/computer-use agent.
Google Gemini: very large context windows, Google integration, native multimodal.
Open-weights (Llama, Mistral, Kimi, DeepSeek): open weights, locally deployable, fine-tunable. The playground of sovereignty and zero marginal cost.
xAI Grok: more permissive on certain content, real-time.

Pierre's CL4R1T4S corpus collects the real system prompts of all these vendors (and of the Cursor, v0, Lovable, Devin, Perplexity tools...). His Pattern Bank extracts 75+ reusable patterns from them, sorted into 13 categories. The meta-lesson: the best prompting practices are the same everywhere, because all these models share the next-token-prediction nature. Learning Claude is learning to talk to all of them.

Update, July 2026: Anthropic's ladder gained a rung. The Claude 5 family (Fable 5, and its unrestricted sibling Mythos 5 reserved for approved organizations) now sits above Opus 4.8, and Sonnet 5 replaced Sonnet 4.6 as the mid-tier default. The foundations module has the full story.

Key points

Families: Claude (code/reasoning/safety), GPT (ecosystem), Gemini (giant context), open-weights (local), Grok (permissive)
Prompting techniques transfer: same next-token-prediction nature
CL4R1T4S: a collection of real system prompts + a Pattern Bank of 75+ patterns
Learning Claude = learning to talk to all LLMs

Refusal map and tactical routing

Each model has its refusal policy declared in its system prompt. Pierre's Refusal Map is a matrix of 14 vendors x 14 categories (copyright, cyber offense/defense, PII, medical, legal, financial, etc.) that summarizes, per model, what is strict, soft, or open.

The use is tactical routing: for a legitimate task that hits an overly broad refusal, pick the model whose policy leaves the appropriate latitude. Documented examples:

Defensive audit / hardening of your own site: Claude is very comfortable.
Legal or financial decision: Claude answers but with a caveat; you keep the human in the loop.

Two honest warnings Pierre notes himself:

The map is declarative (what the prompt says), not behavioral: classifiers may block beyond it. Test for real before cataloguing.
It only serves to unblock the legitimate. The hard lines (real malware, CBRN weapons, sexual content involving minors) are uncrossable everywhere, and rightly so.

The real transferable lesson: giving legitimate context gets a request through. "Security audit of my own site, here is the authorization" is treated very differently from the same request without a frame. Honest framing is the first key, well before the choice of model.

Key points

Refusal Map = matrix of vendors x refusal categories (strict/soft/open)
Tactical routing: for a legitimate task, pick the model with the right latitude
Declarative map, not behavioral: test for real; hard lines uncrossable everywhere
Legitimate, honest framing unblocks more than the choice of model

GPT-5 and the OpenAI family

OpenAI's GPT-5 (released 2025) is the current flagship of the GPT series. It competes directly with Claude Opus in reasoning, long-form writing, and multimodal tasks (meaning it can process both text and images). GPT-5 is notably strong at instruction-following for mass consumer use cases and is deeply integrated into Microsoft products via Azure OpenAI Service.

The OpenAI family is organised in tiers, much like Claude's Opus / Sonnet / Haiku ladder:

GPT-5: flagship, highest capability, highest cost per token (a token is roughly 0.75 words).
GPT-4o ("omni"): fast multimodal model, default in ChatGPT, good cost-to-quality balance.
o3 / o4-mini: OpenAI's "reasoning" models that think step-by-step before answering, similar in concept to Claude's extended thinking mode.
GPT-4o mini: low-cost, high-speed, comparable to Haiku tier.

Compared to Claude, GPT-5 tends to be more permissive on edge-case content and is optimised for breadth of user tasks. Claude (especially Opus, model id claude-opus-4-8) is generally preferred for nuanced long documents, strict instruction chains, and agentic coding workflows where refusals and hallucinations (invented facts) carry high cost. The two model families differ most visibly in context window handling: Claude 3.x and 4.x support up to 200 000 tokens of context, while GPT-5 supports 128 000 tokens in most API configurations.

When routing tasks across models, the practical question is not "which is smarter" but "which is more reliable for this specific task at this cost." GPT-5 via the OpenAI API and Claude via the Anthropic API are both callable from the same orchestration code, so real-world systems often use both, assigning tasks by strength.

Key points

GPT-5 is OpenAI's flagship, competitive with Claude Opus on reasoning and multimodal tasks.
The OpenAI tier ladder: GPT-5, GPT-4o, o3/o4-mini (reasoning), GPT-4o mini.
Claude supports up to 200k token context; GPT-5 API caps at 128k in most configurations.
Route by task fit and cost, not by a single 'best model' verdict.

Gemini and very long context

Google's Gemini family (Ultra, Pro, Flash) is the main rival to Claude and GPT-4 class models. Its defining feature is an enormous context window (the maximum amount of text, code, or data a model can read in one request). As of mid-2026, Gemini 1.5 Pro supports up to 1 million tokens, and Gemini 1.5 Flash up to 1 million tokens at lower cost. For reference, one token is roughly 3 to 4 characters of English text, so 1 million tokens fits several large novels or an entire mid-size codebase.

When does a long context window actually matter? It matters when you cannot break your input into smaller chunks without losing meaning. Common cases include:

Analyzing a full legal contract or research paper without summarizing first
Debugging a large codebase by feeding all files at once
Searching an entire conversation log or transcript for a specific detail
Processing hour-long video or audio transcripts in a single call

Claude models (Opus claude-opus-4-8, Sonnet claude-sonnet-4-6) offer up to 200 000 tokens of context, which covers most professional tasks. Gemini's edge is the cases where even 200 000 tokens is not enough. The practical tradeoff: quality of reasoning tends to be higher in Claude and GPT-4 class models on complex multi-step tasks, while Gemini Flash trades some reasoning depth for speed and price at scale.

Update, July 2026: Google discontinued the Gemini CLI on June 18, 2026 for free, Pro and Ultra tiers, replacing it with the closed-source Antigravity CLI (no feature parity at launch). Anywhere this course mentioned the gemini command as a CLI alternative, read Antigravity CLI, with that caveat.

Key points

Gemini Pro and Flash: up to 1 million token context window
Context window size matters most when input cannot be chunked
Claude tops out near 200 000 tokens, strong reasoning quality
Choose the model by task shape, not brand loyalty

Open models: Llama, Mistral

Open-weights models are AI models whose trained parameters (the numerical values that define the model's behavior) are released publicly, so anyone can download and run them locally or on their own servers. The two most prominent families are Meta Llama (Llama 3, Llama 4) and Mistral (Mistral 7B, Mixtral, Mistral Large). Unlike Claude or GPT, no API key or monthly subscription is required to run them once downloaded.

The core trade-off is control versus capability. Open-weights models give you full data privacy (nothing leaves your machine), zero per-token cost at inference time, and the ability to fine-tune (re-train on your own data) for a specific domain. The cost is that you supply the hardware, manage updates, and accept that frontier capability still lags behind the top proprietary models like Claude Opus or GPT-4o as of mid-2026.

When routing a workload, prefer open-weights models when one or more of these conditions apply:

Data sensitivity: legal, medical, or internal documents that must not leave your infrastructure.
High volume, low complexity: classification, extraction, or summarization tasks where a 7B or 8B model is accurate enough and cost per call matters.
Fine-tuning required: you need domain vocabulary or a house style that prompt engineering alone cannot deliver reliably.
Offline or edge deployment: no reliable internet connection, or latency constraints that a remote API cannot meet.

A practical stack: run Ollama (a local model server, free) to serve Llama or Mistral on your laptop or a rented GPU, then point your code at http://localhost:11434 using the same OpenAI-compatible API shape. For production, quantized (compressed) 4-bit versions of Llama 3 8B run on a single consumer GPU with 8 GB VRAM.

Key points

Open-weights: parameters are public, self-hostable
Best for: privacy, high volume, fine-tuning, offline
Ollama serves Llama/Mistral locally via REST API
Trade-off: control and cost vs. frontier capability

Running a model locally with Ollama

Local inference means running an AI model entirely on your own machine, so no data ever leaves your hardware. Ollama is the most popular tool for this: it downloads open-weight models (models whose weights are publicly released), manages them like Docker images, and exposes a local REST API on port 11434.

The core tradeoff is capability versus control. Cloud models like Claude Opus or GPT-4 run on provider servers and give you the best reasoning at the cost of sending your text to a third party. Local models run on your CPU or GPU with zero network calls, but they are smaller and less capable for complex reasoning tasks.

Key use cases for local inference:

Privacy-sensitive data: medical records, legal documents, internal code you cannot send to an external API.
Offline or air-gapped environments: factories, field devices, or secure networks with no internet.
Cost at high volume: once the model is downloaded, each call is free, making it attractive for millions of short completions.
Low-latency loops: a model running locally can respond in under a second on a modern GPU, avoiding round-trip network time.

The main models available through Ollama include Llama 3 (Meta), Mistral, Gemma (Google), Phi-3 (Microsoft), and many fine-tuned variants. None of them match Claude Opus on hard reasoning today, but they are entirely adequate for classification, summarization, templated extraction, and code completion on familiar patterns.

Key points

Local inference: model runs on your hardware, no data sent out
Ollama manages open-weight models and serves a local API
Tradeoff: privacy and zero cost per call vs. lower capability
Best for sensitive data, offline use, or very high call volumes

Composing a multi-vendor system prompt

No single vendor's default behavior is optimal for every task. Multi-vendor prompt composition means reading the published or reverse-engineered system prompts of several AI products, extracting the rules that matter for your use case, and merging them into one coherent system prompt you control.

Each vendor has solved a different problem well. Cursor (an AI code editor) enforces strict file-edit discipline: it never rewrites a file it has not read first and always shows a diff before applying changes. Perplexity enforces inline citation: every factual claim carries a numbered source reference. GPT-4o's system prompt enforces anti-hedging: it forbids phrases like "I think" or "I'm not sure" when the model has enough context to be direct. Cline and Devin enforce autonomous loop discipline: the model must declare a plan, execute it step by step, and halt only on ambiguity or cost gates.

When you combine these into one system prompt for Claude (using claude-opus-4-8 for complex reasoning or claude-sonnet-4-6 for speed), you get a single agent that cites sources, edits files safely, stays direct, and runs autonomously without constant confirmation prompts. The technique is sometimes called a Frankenstein prompt because it stitches rules from multiple sources into one body.

Read before write (Cursor): always read a file before editing it; show a diff summary.
Cite every claim (Perplexity): append [source: ...] or a numbered footnote to factual statements.
No hedging (GPT-4o): ban filler phrases; be direct when context is sufficient.
Plan then execute (Cline/Devin): declare steps before acting; stop only on ambiguity or cost gate.
Archiver, jamais supprimer (owner rule): never delete, always move to _ARCHIVES/.

Key points

Extract the strongest rule from each vendor prompt
Merge rules into one system prompt without contradiction
Test the composed prompt against a real task before deploying
Owner rules always override vendor defaults

Tactical routing per task

Not every task deserves the same model. Tactical routing means choosing the model whose strengths match the job at hand, so you spend compute where it pays off and avoid paying a premium for tasks that need no deep reasoning.

The three tiers in June 2026 are: Opus (claude-opus-4-8) for complex reasoning, architecture, and judgment calls; Sonnet (claude-sonnet-4-6) for the broad middle ground of coding, drafting, and analysis; Haiku (claude-haiku-4-5) for fast, high-volume, simple tasks such as classification or extraction. Routing wrong in either direction costs you: using Opus to rename a variable wastes budget, using Haiku to design a distributed system risks shallow output.

A practical routing heuristic covers four signals:

Stakes: will a wrong answer cause real harm or rework? Bias toward Opus.
Novelty: is the problem well-defined and repetitive? Haiku or Sonnet suffices.
Output length: long structured documents benefit from Sonnet or Opus reasoning over many tokens.
Latency budget: if a user is waiting under two seconds, Haiku wins on speed regardless of task complexity.

In Claude Code (the CLI and IDE coding agent) you switch model with the --model flag or the /model command inside a session. Agents and pipelines built on the Anthropic API can route programmatically by passing the model parameter per request, so a single pipeline can use Haiku for pre-filtering and Opus only for the final judgment step.

Key points

Tactical routing: matching model tier to task requirements
Opus for judgment, Sonnet for breadth, Haiku for speed
Four routing signals: stakes, novelty, output length, latency
Claude Code --model flag lets you switch per session or per call

Refusal differences across vendors

Every major LLM vendor trains its model with a refusal policy: a set of rules that cause the model to decline certain requests. These policies differ in scope, tone, and consistency. Knowing the differences lets you route tasks to the model most likely to complete them without friction.

The main dimensions where vendors diverge are listed below. A hard refusal means the model will not comply regardless of how the prompt is phrased. A soft refusal means the model resists by default but can be unlocked with system-prompt context, role assignment, or explicit permission from the API caller.

Medical and legal detail: Claude (Anthropic) tends to add disclaimers but will go further with a system prompt that establishes a professional context. GPT-4o (OpenAI) is similar. Gemini (Google) is more conservative on clinical specifics.
Security and offensive content: All major vendors hard-refuse step-by-step weapon synthesis. For dual-use security topics (penetration testing, exploit analysis), Claude with an operator system prompt is generally the most permissive among the big three.
Creative fiction with dark themes: Claude allows mature literary content when the operator enables it. GPT-4o is stricter by default on violence and explicit content. Open-weight models (Llama, Mistral) running locally have no enforced policy at all.
Political and controversial opinion: Claude declines to express personal opinions on contested political topics. GPT-4o behaves similarly. Open-weight models will often state an opinion if asked directly.

The practical routing strategy is: use a system prompt to establish context before the refusal happens. If a model still refuses after context-setting, switch vendors rather than trying to trick the model with prompt injection (a technique that tries to override instructions by hiding commands in the input), which is unreliable and violates terms of service.

Key points

Hard vs. soft refusals depend on vendor policy and operator context
System-prompt context is the legitimate unlock mechanism
Open-weight models have no enforced refusal policy
Route by task type: choose the vendor whose policy fits the use case

Fine-tuning vs prompting

A large language model (LLM) like Claude can follow instructions written in plain text, a technique called prompting. Fine-tuning is different: you take an existing model and continue training it on your own dataset so the weights themselves change. Both approaches can make a model behave the way you want, but they solve different problems.

Prompting wins in most cases because it is fast, cheap, and reversible. You iterate in minutes, pay only for inference (the compute used when the model answers), and switch models without losing anything. Fine-tuning requires collecting hundreds or thousands of labeled examples, paying for GPU training time, hosting the resulting model, and repeating the whole process whenever your needs change.

Fine-tuning earns its cost in a narrow set of situations:

Latency and cost at massive scale: a small fine-tuned model (7B or 8B parameters) answering millions of requests per day is far cheaper than routing every call to a frontier model.
Highly structured output: if you need the model to always emit valid JSON in a fixed schema, fine-tuning enforces the format more reliably than a prompt.
Domain vocabulary or style: medical, legal, or industry-specific text where the base model consistently uses wrong terminology.
Data cannot leave your servers: a locally hosted fine-tuned model avoids sending sensitive records to a third-party API.

A practical rule: exhaust prompting first. Use system prompts, few-shot examples (a handful of input/output pairs included in the prompt), and retrieval-augmented generation (RAG) (fetching relevant documents at runtime) before touching fine-tuning. Fine-tuning fixes behavior; prompting shapes it. If the gap between what the model does and what you need is a matter of knowledge or style that fits in a context window, prompting is almost always the right answer.

Key points

Prompting is fast and reversible; prefer it by default
Fine-tune only for scale, strict output format, domain vocabulary, or data privacy
Few-shot examples and RAG can replace fine-tuning in many cases
Fine-tuned small models cut cost at high request volume

Other LLMs and routing

The landscape: Claude, GPT, Gemini, and the rest

Refusal map and tactical routing

GPT-5 and the OpenAI family

Gemini and very long context

Open models: Llama, Mistral

Running a model locally with Ollama

Composing a multi-vendor system prompt

Tactical routing per task

Refusal differences across vendors

Fine-tuning vs prompting

Need this level of execution on your project?

Inspired by 0xloucash