Home / Advanced prompt engineering

Level: Advanced · 12 lessons

Advanced prompt engineering

Chaining, tool use, multi-vendor composition (Frankenstein), citation, anti-hedging.

Open the interactive course237 lessons, quizzes, exercises, a final exam with a diploma, 3 languages, free.

Chaining prompts

A complex task run in a single monolithic prompt fails more often than a chain of simple prompts, each with one responsibility. The principle: split, and feed the output of one as the input of the next.

Example: "write a complete SEO article" gives something lukewarm. A chain gives something solid:

Prompt 1: research and list the angles + keywords.
Prompt 2: from that list, produce a detailed outline.
Prompt 3: write section by section from the outline.
Prompt 4: proofread and fix (a critic agent).

Each link is verifiable and fixable independently. It is also the mental basis of multi-agent workflows: a pipeline where each step is a specialized agent. Pierre's rule "Opus for architecture, Haiku for the repetitive" plays out here: you chain by assigning the right model to each link.

Key points

Split a complex task into a chain of simple prompts
Output of one link = input of the next; each link verifiable
Basis of workflows: a pipeline of specialized agents
Assign the right model to each link

Tool use: giving the model hands

Tool use (tool calling, or function calling) lets Claude call functions you define: query a database, hit an API, do a calculation, read a file. You describe the tool (name, description, parameter schema), Claude decides when and how to call it, you execute and return the result.

It is the engine of every agent (Claude Code, Cowork, MCP). Best practices for defining a tool, drawn from the best harnesses:

Clear, usage-oriented description: it is what drives the right tool choice, exactly like a skill description.
Strict parameter schema: types and required fields, for valid calls.
Do not expose the tool name to the user in the conversation (the Cursor/Cline rule of the Frankenstein): say "I'm editing your file", not "I'm calling edit_file".
Parallel calls if independent, sequential only when there is a dependency.

Structured output (forcing Claude to call a tool that validates a JSON schema) is the most reliable way to get program-usable data, better than parsing free text.

Key points

Tool use = Claude calls functions you define (the engine of agents)
Clear description + strict parameter schema
Don't name the tool to the user; parallelize if independent
Schema-validated structured output > parsing free text

Composing a system prompt: the Frankenstein case

Pierre's most advanced technique: composing a system prompt by assembling the best rules from several elite system prompts. His "Frankenstein" fuses eight sources (the Fable 5 roleplay, plus the disciplines of Cursor, GPT-5, Perplexity, Lovable, v0, Cline, Devin) and layers his own absolute rules on top, with priority.

It is not fine-tuning, it is pure prompt engineering: you don't change the model's weights, you change its behavior by instruction. Structure of the document, in descending priority:

Identity and absolute user rules (override everything).
Tool-use discipline (don't name tools, read before editing, max 3 attempts).
Anti-hedging: an explicit list of forbidden openings and closings.
Style, code rules, UI/UX directives, search, citation, refusal, error recovery, workspace safety.

Transferable lessons, even without copying his setup:

Put the priority rules at the top and declare it explicitly ("priority over everything else").
Forbidding by list is more effective than asking vaguely: "forbidden openings: Great, Certainly, Sure" beats "be direct".
Encode the scar lessons: every lived incident becomes a rule (nginx 410, PowerShell Unicode, the underscore path).

Key points

Composing a system prompt = assembling the best rules from several sources
Pure prompt engineering, no fine-tuning
Priority rules at the top, declared as priority
Forbid by explicit list > ask vaguely; encode each incident into a rule

Citation and anti-hedging in daily use

Two style disciplines that change the perceived quality, drawn straight from the Frankenstein.

Citation (Perplexity discipline), whenever you do research:

Inline brackets right after the sentence, no space: text.[1][2]
One source per bracket, maximum 3 sources per sentence, the most relevant.
No final "References" section; sources are attached to the claims.

Anti-hedging (GPT-5 + Cline): ban empty openings ("Sure", "Of course") and opt-in closers ("would you like me to..."). At most one clarification question at the start if necessary, never at the end. If the next step is obvious, execute it rather than propose it.

Why it matters: hedging dilutes the signal and slows the reader. An answer that acts (or explains) directly respects the user's time. It is exactly the tone of this Bible, by construction.

Key points

Citation: inline brackets, one source per bracket, max 3, no References section
Anti-hedging: no empty opening or opt-in closer
At most one clarification question, at the start, never at the end
If the next step is obvious, execute it

Prompt chaining patterns

A single prompt has limits. When a task is complex, breaking it into a prompt chain (a sequence of linked prompts where each output feeds the next) produces far better results than cramming everything into one giant instruction. Each link in the chain has a narrow, well-defined job, which makes it easier to spot and fix mistakes.

The core technique is result passing: you take the output of step N and paste it (or inject it programmatically) as context into step N+1. In Claude Code, you can build chains inside scripts, using shell variables or files as the bridge between calls. In a chat session you simply copy the relevant part of the answer into your next message.

Common chain patterns worth knowing:

Extract then transform: first extract raw data or facts, then reformat or analyse them in a second call.
Draft then critique: generate a first draft, then run a separate prompt that reviews it against a checklist and returns improvements.
Decompose then solve: ask Claude to break a problem into sub-tasks, then solve each sub-task individually and assemble the results.
Translate then localise: translate text first, then run a localisation pass that adapts idioms for the target culture.

Keep each prompt in the chain atomic (doing one thing only) and include a brief context header at the top of each subsequent prompt so Claude is not starting cold. A chain of three focused prompts consistently beats one sprawling mega-prompt for accuracy and editability.

Key points

Prompt chain: a sequence of prompts where each output feeds the next step
Result passing: injecting a previous answer as context into the following prompt
Atomic prompt: a prompt with exactly one clearly bounded task
Draft-critique pattern: generate first, then review in a separate call

The tool use loop in depth

When you give Claude a tool (a function it can call, such as a web search or a code executor), the model does not just answer once. It enters a tool use loop: it decides to call a tool, reads the result, then decides what to do next, repeating until it can give a final answer. Each round trip is called a turn.

The sequence inside one loop iteration is always the same:

Tool call: Claude emits a structured request naming the tool and its arguments (for example, {"name": "search", "input": {"query": "Claude Opus 4 release date"}}).
Execution: Your code (or the host environment) runs the tool and returns a tool result block containing the output.
Continuation: Claude reads the result as part of the conversation context and either calls another tool or produces a final text response.

Three things control how the loop behaves. The system prompt tells Claude what tools exist and when to use them. The tool definition (name, description, JSON schema for inputs) shapes whether Claude picks the right tool with the right arguments. The tool result you return must be clear and complete, because Claude cannot ask the tool a follow-up question: it can only call it again with different arguments.

Common failure modes: vague tool descriptions cause Claude to skip the tool or pass wrong arguments; truncated or error-free-looking results (when the real call failed) cause Claude to hallucinate the next step; and loops that never terminate happen when the tool keeps returning ambiguous output. A well-designed tool description is often more important than prompt length.

Key points

Tool use loop: call, execute, read result, repeat or finish
Tool definition quality controls argument accuracy
Tool result clarity prevents hallucinated follow-up steps
The host environment runs the tool, not the model itself

Structured outputs with a schema

A schema is a formal description of the shape you want your data to take: which fields exist, what type each field holds (string, number, boolean, array), and which fields are required. When you attach a schema to a Claude prompt, you are telling the model exactly what JSON (JavaScript Object Notation, a lightweight text format for structured data) to return, and nothing else.

Claude supports schema enforcement in two ways. First, you can describe the schema inside your prompt as a plain JSON object and instruct Claude to follow it. Second, when calling the Anthropic API directly, you can use tool use (also called function calling): you define a tool whose input schema matches the object you want, then instruct Claude to call that tool. The API guarantees the response fits the schema, so you get machine-readable output without parsing free text.

Even with schema enforcement, outputs can still fail validation in edge cases: a required field may be null, a number may arrive as a string, or an enum value may be misspelled. A robust pipeline therefore adds a validation and retry loop: parse the JSON, run a validator (such as a JSON Schema library), and if it fails, send the error message back to Claude in a follow-up turn so it can correct only the broken fields.

Key principles for reliable structured output:

Keep schemas flat and small. Deeply nested schemas increase error rates.
Provide an example object in the prompt alongside the schema. Claude treats the example as a ground truth reference.
For enum fields, list every allowed value explicitly. Claude will not invent values it did not see.
On retry, quote the exact validation error and ask Claude to fix only that field, not rewrite everything.

Key points

A schema defines the exact shape (fields, types, required status) of the JSON you want back.
Tool use (function calling) enforces schema compliance at the API level.
Always validate the output programmatically and retry with the error message if it fails.
Flat schemas with explicit enum values produce the fewest errors.

Writing evals

An eval (short for evaluation) is a small, structured test you run against your prompt to measure whether it reliably produces the output you want. Without evals, you are guessing: a prompt that works on one example might silently break on ten others.

The core idea is to build a test set: a fixed collection of inputs paired with the expected outputs (or a scoring rule). You run every test case through your prompt and track the pass rate. When you revise the prompt, you run the test set again and compare scores. This turns prompt improvement from intuition into measurement.

A minimal eval has three parts:

Cases: 10 to 30 representative inputs covering normal use, edge cases, and likely failure modes.
Expected outputs: either exact strings, keywords that must appear, a rubric (1-5 scale), or a second LLM call acting as a judge.
A runner: a script or spreadsheet that applies the prompt to every case and records pass or fail.

Even a five-row spreadsheet beats zero structure. Start small, add cases each time a real user finds a bug, and never remove a case once it catches a regression (a previously working behavior that breaks after a prompt change).

Key points

Eval: a repeatable test set that scores prompt quality
Test set: fixed inputs paired with expected outputs or scoring rules
Pass rate: fraction of cases where the output meets the criteria
Regression: a behavior that worked before and silently breaks after a prompt change

Meta-prompting

Meta-prompting means using a language model (LLM) to write, critique, or improve a prompt, rather than writing that prompt entirely by hand. The idea is recursive: the model becomes a collaborator in shaping the instructions it will later follow.

This technique is useful when you are stuck on phrasing, when a prompt works but you suspect it could work better, or when you need to generate many prompt variants quickly. The model has seen enormous amounts of text about how models respond, so it can often spot weaknesses you would miss.

A basic meta-prompt has three parts:

Context: tell the model what the downstream task is and who the end-user will be.
The draft prompt: paste the prompt you want improved, or describe the goal if you are starting from scratch.
A specific instruction: ask for a rewrite, a list of weaknesses, alternative phrasings, or a scoring rubric.

You can go further by chaining steps: first ask the model to critique, then ask it to rewrite based on its own critique, then ask it to generate three variations ranked by expected clarity. Each step costs tokens but narrows in on a stronger prompt without you having to guess what is wrong.

Key points

Meta-prompting: a prompt whose job is to improve another prompt.
Include context, the draft, and a specific instruction.
Chain critique then rewrite for sharper results.
Treat the output as a draft, not a final answer, and test it.

Guardrails and validation

A model can produce fluent, confident output that is factually wrong, structurally broken, or unsafe to act on. Guardrails are checks you add between the model's raw output and any action that consumes it. They turn a blind trust in the model into a controlled pipeline.

The simplest guardrail is a format check: verify that the output is the shape you asked for (valid JSON, a specific number of items, no forbidden strings) before passing it downstream. A second layer is semantic validation: ask a second, cheaper model call to judge whether the answer is coherent, on-topic, or safe. This is sometimes called an LLM-as-judge pattern.

In Claude Code (the CLI and IDE coding agent), you can chain validation steps in a shell pipeline or a script. Common approaches include:

Schema assertion: parse JSON output with JSON.parse() and throw if required keys are missing.
Regex fence: reject output that contains patterns like raw API keys or PII (personally identifiable information) before logging or storing it.
Self-critique prompt: send the output back to the model with a prompt such as "List any factual errors or missing steps in the text above." Treat a non-empty list as a failure signal.
Deterministic unit test: when the model writes code, run the test suite and treat a red build as an automatic rejection.

Guardrails add latency and cost, so apply them proportionally. High-stakes actions (sending an email, writing to a database, deploying code) deserve hard checks. Low-stakes actions (drafting a summary for human review) can rely on a lighter touch or none at all.

Key points

Guardrails check model output before it is acted on
Format checks and schema assertions are the first line of defense
LLM-as-judge uses a second model call to validate the first
Apply stricter guardrails to irreversible or high-stakes actions

Self-consistency and voting

Most prompting strategies send one request and trust the first answer. Self-consistency breaks that assumption: you sample the same question several times, let the model reason independently each time, then pick the answer that appears most often. That majority vote is statistically more reliable than any single reply, especially on math, logic, and multi-step reasoning tasks.

The core idea comes from a 2022 paper (Wang et al.) showing that language models do not always land on the same reasoning path twice. Some paths are wrong. If you run the same prompt five times and four runs agree, the odds that all four share the same error are low. Voting (also called majority aggregation) exploits that independence.

When to use it:

Hard math or coding problems where a single chain-of-thought (the step-by-step reasoning trace) can drift.
Classification tasks where you want a confidence signal, not just a label.
Any answer where you suspect the model might be inconsistent across runs.
High-stakes decisions where you can afford a few extra API calls.

Trade-off: cost and latency multiply by the number of samples. Use temperature (the randomness knob, where 0 is deterministic and 1 is creative) above 0 so each sample diverges. A value around 0.7 works well. Then parse the answers programmatically and count the most common one.

Key points

Self-consistency: sample the same prompt N times, take the majority answer
Temperature above 0 is required so each run produces a different reasoning path
Voting filters noise without changing the model or the prompt
Cost scales linearly with sample count, so reserve this for hard or high-stakes questions

Adversarial self-check

An adversarial self-check is a technique where you ask the model to argue against its own answer immediately after it produces one. Instead of treating the first response as final, you prompt the model to act as a critic and find flaws, gaps, or errors in what it just said. This exploits the model's reasoning ability to catch mistakes that a single forward pass often misses.

Why does this work? Language models (LLMs, meaning large language models trained on text) are prone to confirmation bias in generation: once a reasoning chain starts in one direction, each token makes the next token more likely to continue that direction. A separate critic pass resets that momentum and can surface contradictions, missing edge cases, or overconfident claims.

There are two main forms of adversarial self-check:

Inline refutation: add a second instruction in the same prompt, asking the model to follow its answer with a "devil's advocate" section that challenges every major claim.
Separate critic turn: send the model's answer back in a new message with an explicit instruction such as "List every factual error, logical gap, and unsupported assumption in the text above."

After the critic produces objections, you run a synthesis pass: ask the model (or judge yourself) which objections are valid, then request a revised answer that addresses only the valid ones. Three turns total: generate, critique, synthesize.

Key points

Adversarial self-check catches errors that a single response misses
Use an inline devil's advocate section or a separate critic message
Reset confirmation bias by treating the critic as a fresh perspective
Always run a synthesis pass to filter valid objections from noise

Work with me

Need this level of execution on your project?

I am Pierre Bottazzi. I built this entire course solo, end to end: 237 lessons in 3 languages, the app, the design, the SEO, the accounts system. That is what I do for clients too: web apps, mobile apps, AI automation, SEO/GEO. First call is free, no strings attached.

Contact me on LinkedIn See sept-tools.com (industry)See totemsauvage.com (art gallery)

Inspiration

Inspired by 0xloucash

One of my inspirations. Loucash (0xloucash) has a gift for always digging up the sharpest AI tips and tricks, then turning them into setups that actually work. With InstallClaw he configures your own OpenClaw AI agent, at your place, in 48 hours.

His Instagram InstallClaw