Home / LLM foundations

Level: Beginner · 11 lessons

LLM foundations

Tokens, context, temperature, the Claude family. The bedrock.

Open the interactive course237 lessons, quizzes, exercises, a final exam with a diploma, 3 languages, free.

What an LLM really is

A large language model (LLM) does not "understand" like a human and does not "look things up" in a database. It does one thing, billions of times: predict the next chunk of text most likely given everything that came before.

Practical consequences, which explain 90% of the surprises:

It is excellent at patterns (style, structure, idiomatic code) because those are regularities of language.
It can invent with confidence (hallucinate): a false but plausible text is still statistically likely. Hence the rule later: give it the sources, do not trust its memory for facts.
It has no memory between conversations. Everything it "knows" about you lives in the context handed back to it on every call.
Its knowledge stops at a training cutoff date. For current events, you must give it a web search.

Keep the image: a brilliant but amnesiac improviser. Your job is to give it the right set (context) for each scene.

Key points

An LLM predicts the next token, nothing else
Hallucinating = producing plausible-but-false output, a structural risk
No memory between sessions: everything lives in the context
Knowledge frozen at a cutoff date

Tokens and the context window

The model does not see letters or words, but tokens: fragments of text. Roughly, 1 token is about 4 characters, around 0.75 word in English. "interesting" can be 2-3 tokens.

The context window is the maximum number of tokens the model can process at once: your prompt + the history + the files + its answer. Recent Claude models go up to 200K tokens, and some configurations up to 1M tokens (the model writing this runs at 1M).

Why it is central:

Everything is paid in tokens (input + output). More context = more expensive and slower.
When the window fills up, you must summarize or clean (we will see /compact and /clear in Claude Code).
A context overloaded with noise degrades quality: useful signal drowns. "More context" is not "better".

Module 7 is entirely dedicated to mastering context and cost. For now, remember the unit: the token.

Key points

Token = fragment of text, ~4 characters
Context window = input + history + files + output, capped (200K, sometimes 1M)
Everything is billed in tokens, input and output
Too much noisy context degrades quality, not just cost

Temperature, and the myth of the magic setting

Temperature tunes the randomness of the prediction. Low (0 to 0.3): the model almost always picks the most likely token, stable and predictable answers, good for code, extraction, classification. High (0.7 to 1): more diversity, good for brainstorming and creativity.

Two other settings you will meet in the API:

max_tokens: the maximum length of the answer. Too low = cut-off response.
top_p: an alternative to temperature (nucleus sampling). Usually you touch one or the other, not both.

The classic beginner trap: believing you fix a bad result by fiddling with temperature. 95% of the time the problem is the prompt, not the setting. A clear prompt at temperature 0.3 beats a vague prompt at any temperature. We tune temperature last, not first.

Key points

Low temperature = stable/factual; high = creative/diverse
max_tokens caps the output (watch for cut-offs)
A bad result almost always comes from the prompt, not the temperature

The Claude family: Opus, Sonnet, Haiku

Anthropic ships each generation in three sizes, which trade intelligence for speed and cost:

Opus: the most capable. Architecture, hard reasoning, gnarly debugging, brainstorming. The slowest and most expensive.
Sonnet: the balance. The daily workhorse, very good quality/cost ratio.
Haiku: the fastest and cheapest. Repetitive tasks, classification, volume, multi-language.

Model identifiers (useful in the API and in Claude Code) for the current generation:

Opus 4.8: claude-opus-4-8
Sonnet 4.6: claude-sonnet-4-6
Haiku 4.5: claude-haiku-4-5-20251001

Pierre's rule, applied across his practice: Opus for architecture, brainstorming and debugging; delegate the repetitive, the multi-language and the auditing to Sonnet or Haiku via sub-agents. More on this in the multi-agent module. On billing, Claude calls are the cheap resource in his setup: only paid external services really count.

Key points

Opus = power, Sonnet = balance, Haiku = speed/volume
Same ids everywhere: claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5-...
Pierre: Opus for architecture/debug, Sonnet/Haiku for delegated repetitive work

Meaning without a dictionary: embeddings

When a language model reads a word or sentence, it does not look it up in a dictionary. Instead it converts the text into a vector, which is a long list of numbers (often hundreds or thousands of values). That list is called an embedding. Each number captures a tiny facet of meaning, so the whole list together represents what the text "means" to the model.

The key insight is that similar meanings produce similar vectors. In the mathematical space where these vectors live (called embedding space), words and phrases cluster by meaning. "Doctor" and "physician" end up close together. "Dog" and "cat" are nearby each other but far from "invoice". The model never needed a rule saying those words are related; it learned the positions by processing billions of sentences.

This geometry of meaning is what lets Claude answer questions, find relevant passages, and understand context. When you ask a question, the question is turned into a vector, and the model finds content whose vector sits nearby in embedding space. That process is called semantic search (search by meaning, not by exact words).

Vector: a list of numbers that encodes a concept.
Embedding: the specific vector a model assigns to a piece of text.
Embedding space: the multi-dimensional map where all those vectors live.
Semantic search: finding text by meaning-distance rather than keyword matching.

Key points

Embeddings convert text into lists of numbers
Similar meanings sit close in embedding space
Semantic search uses vector distance, not keywords
Models learn these positions from data, not rules

Sampling: why the same prompt varies

Every time a language model generates text, it picks words one at a time. After each word it looks at a probability list: thousands of candidate next words, each with a score. The way it picks from that list is called decoding, and it is the main reason two identical prompts can produce different answers.

Greedy decoding always picks the single highest-scoring word. It is fast and fully deterministic (meaning the output is always the same), but it tends to produce flat, repetitive text. Sampled decoding introduces randomness: the model draws from the probability list rather than always taking the top item. The degree of randomness is controlled by temperature (covered in the next lesson) and by two filters applied before sampling:

Top-k filtering: keep only the k highest-scoring candidates and discard the rest. If k is 40, only the 40 most likely words are eligible at each step.
Top-p filtering (nucleus sampling): keep the smallest set of candidates whose combined probability adds up to p. If p is 0.9, words that together account for 90 percent of the probability mass are kept; the long tail of unlikely words is cut off. This adapts dynamically: when the model is very confident, fewer words make the cut.

In practice, top-p and top-k are often applied together before temperature-based sampling. Claude's API exposes both parameters. Raising p or k widens the pool and increases variety; lowering them makes the model more predictable. Setting temperature to 0 collapses back to greedy decoding regardless of top-p or top-k settings.

Key points

Greedy decoding always picks the highest-probability word, giving deterministic output.
Top-k limits candidates to the k most likely words at each step.
Top-p (nucleus sampling) keeps the smallest set of words covering p of the total probability.
Sampled decoding introduces useful variety; temperature 0 removes it.

Three voices: system, user, assistant

Every conversation sent to an LLM (large language model) is made of messages, and each message belongs to one of three roles: system, user, or assistant. Understanding these roles tells you exactly how Claude is instructed, who is speaking, and what Claude is allowed to say.

The system prompt is set by whoever builds the product (a developer, a company, or Claude Code itself). It arrives before the conversation begins and tells Claude how to behave: its persona, its limits, its task. The user never sees it unless the builder chooses to show it.

The user turn is your message: the question, instruction, or file you send. The assistant turn is Claude's reply. These two alternate back and forth to form the conversation history that Claude reads every time it responds.

system: invisible instructions from the builder, sets the rules and persona.
user: your input, the prompt you type or the file you attach.
assistant: Claude's reply, generated from everything above it in the thread.

Key points

The system prompt is invisible to the user but controls Claude's behavior.
User and assistant turns alternate to form the conversation history.
Claude reads the full history on every reply, not just the last message.
Knowing which role holds which text helps you debug unexpected behavior.

How Claude was trained

Claude starts life like every large language model (LLM): it goes through pretraining, where it reads a massive portion of the internet, books, and code. During this phase the model learns grammar, facts, reasoning patterns, and writing styles purely by predicting the next word, billions of times over. No human guidance yet, just statistics at enormous scale.

Next comes RLHF (Reinforcement Learning from Human Feedback). Human trainers rate pairs of model responses, and those ratings are used to train a separate "preference model." Claude is then fine-tuned to produce outputs that score well on that preference model. This is how raw text prediction becomes a helpful assistant that follows instructions and avoids obvious mistakes.

Anthropic adds a third layer called Constitutional AI (CAI). Instead of relying only on human raters, CAI gives the model a written set of principles (a "constitution") and has the model critique and revise its own answers against those principles. This makes the alignment process more scalable and more transparent, because the rules are explicit rather than buried in rater intuitions.

These three phases shape everything you experience when talking to Claude:

Pretraining determines what Claude knows and how it reasons.
RLHF determines how helpful and instruction-following it is.
Constitutional AI determines its safety boundaries and consistent values.
All three together explain why Claude can write code fluently but will decline certain requests without being told to by the user.

Key points

Pretraining: learning language from raw text at scale
RLHF: shaping behavior with human preference ratings
Constitutional AI: self-critique against written principles
Training phases determine knowledge, helpfulness, and safety limits

Attention and why position matters

Every modern LLM (large language model) is built on a mechanism called attention. When the model reads your prompt, it does not treat every word equally. Instead it scores each word (or token) against every other word and decides which ones are most relevant to each step of the answer. Think of it as the model asking: "to write this next word, which earlier words should I lean on most?"

Because attention scores are computed across the entire context window (the total text the model can see at once), the model can in theory connect any two pieces of information, no matter how far apart. In practice, though, researchers have observed a pattern called lost-in-the-middle: models tend to recall information placed at the very beginning or at the very end of a long prompt far better than information buried in the middle.

This has a direct, practical consequence for how you structure prompts and documents you pass to Claude:

Put the task or question first (or at least very early). The model anchors attention on the opening tokens.
Put critical facts or constraints near the end, just before you expect the answer to begin. End-of-prompt content is retrieved reliably.
Avoid burying key rules in the middle of a long block of background text. Those rules are most likely to be ignored or forgotten.
Use structure (headers, bullet lists, explicit labels like "IMPORTANT:") to boost attention on critical passages wherever they live.

The same principle applies when you feed Claude a long document and ask a question about it. Place your question before the document, restate it briefly after, and highlight the relevant section with a label. That sandwich structure fights the lost-in-the-middle effect and consistently produces better answers.

Key points

Attention weights every token against every other token to decide relevance
Lost-in-the-middle: information buried in a long prompt is recalled least reliably
Place tasks early, critical constraints late, and use structure to signal importance
Restating a question before and after a long document improves recall

The knowledge cutoff and grounding

Every large language model (LLM) is trained on a snapshot of text gathered up to a specific date, called the knowledge cutoff. After that date, the model has no awareness of new events, updated prices, revised laws, or anything else that changed. Claude's knowledge cutoff is August 2025, so it cannot answer reliably about things that happened after that point.

This creates a practical problem: the world keeps moving while the model stays frozen. A question about current stock prices, the latest software release, or a recent political event will likely produce an outdated or simply wrong answer, even from a capable model. The model does not know what it does not know, so it may answer with false confidence.

Grounding is the technique used to fix this. It means giving the model access to fresh, reliable information at the moment it answers, rather than relying only on what it memorized during training. The two most common grounding methods are:

Web search integration: the system retrieves live search results and injects them into the model's context before it replies. Claude.ai can do this with its built-in search toggle.
Supplied sources: you paste or attach the relevant text yourself (a document, a webpage excerpt, a data file). The model reasons over what you gave it, not its stale memory.

Grounding does not make the model infallible, but it shifts the bottleneck from frozen training data to the quality of the sources you provide. Always cite or check those sources independently for anything that matters.

Key points

Knowledge cutoff: the date beyond which a model has no training data
Grounding: supplying current sources so the model reasons over fresh facts
Web search integration injects live results into the model context
Pasting or attaching text is the simplest form of manual grounding

The Claude 5 era: Fable and Mythos

On June 9, 2026, Anthropic launched a new model family called Claude 5, introducing a tier above the familiar Opus/Sonnet/Haiku stack. A model tier is a naming band Anthropic uses to signal relative capability and price (Haiku is the fastest and cheapest tier, Sonnet the balanced middle, Opus the previous top). The new tier is called Mythos-class, and it sits above Opus. Two models share this same underlying Mythos-class model: Fable 5 (API id claude-fable-5) and Mythos 5 (API id claude-mythos-5). They have identical capabilities, pricing, and API behavior. The only difference is who can access them and what safety checks run on each.

Fable 5 is the generally available (GA) version, meaning any paying customer can call it through the API or use it inside a Claude app. Mythos 5 is invitation-only, reserved for organizations approved under a program called Project Glasswing. Approved categories include cyberdefenders (security teams protecting infrastructure), infrastructure providers, and organizations with what Anthropic calls "bio trusted-access" (vetted access to biological-research-adjacent capability). For an everyday user or developer, this distinction matters in one practical way: on a paid plan you get Fable 5, and Mythos 5 is simply not available to you unless your employer has been individually approved into Project Glasswing.

The difference between the two models is a set of three classifier-based safety safeguards built into Fable 5. A classifier here is a smaller, automated system that scans a request and flags it if it matches a risky pattern, before or during the model's response. Fable 5's three safeguards target: offensive cyber capability (helping build attack tools), dangerous bio/chem content (helping synthesize weapons-relevant material), and distillation prevention (stopping someone from systematically extracting Fable 5's own reasoning patterns to train a rival model cheaply). Anthropic reports these safeguards trigger in under 5% of sessions, so the overwhelming majority of everyday coding, writing, and analysis work is unaffected. Mythos 5 runs without these dual-use classifiers, which is exactly why it is restricted to vetted organizations rather than opened to everyone: removing the safeguards is only acceptable when the requester's trustworthiness has already been established.

A notable design choice: when one of Fable 5's three safeguards fires, the request does not receive a flat refusal. Instead it falls back to Opus 4.8, Anthropic's next-tier model, which answers the request under its own (less restrictive) safety profile. This means a legitimate security researcher asking a borderline cybersecurity question is more likely to get a useful answer from Opus 4.8 than to hit a dead end. Before release, Anthropic commissioned over 1,000 hours of external red-teaming (independent security researchers professionally trying to break the model's safety measures) and reported no universal jailbreak was found, meaning no single trick reliably bypassed all the safeguards at once.

The launch was not without drama. On June 12, 2026, just three days after release, the United States government applied export controls to the new model tier: legal restrictions on which countries or entities are allowed to access certain advanced technology. Anthropic could not verify, in real time, the nationality of every user making a request through the API. Rather than risk violating the controls, Anthropic suspended both Fable 5 and Mythos 5 for everyone globally, not just users in restricted regions. The controls were lifted on June 30, 2026, and Anthropic redeployed Fable 5 worldwide on July 1, 2026. The redeployed version shipped with an additional anti-jailbreak classifier that Anthropic says blocks a previously known bypass technique in over 99% of cases, an improvement made during the three-week suspension window.

On the API side, Fable 5 costs $10 per million input tokens and $50 per million output tokens, both above Opus 4.8's $5/$25 pricing, reflecting its higher-tier status. It offers a 1 million token context window (the amount of text it can consider at once) and up to 128,000 tokens of maximum output per response. A key technical detail: Fable 5 always runs with adaptive thinking, an internal reasoning mode where the model decides for itself how much to deliberate before answering, and this cannot be turned off through the API, only tuned in depth via an "effort" setting.

For a beginner, the practical takeaway is simple: as of July 2026, if you are a paying Claude user, Fable 5 is the most capable model you can reach, and Mythos 5 exists as an equally powerful sibling that most people will never touch because it requires organizational approval, not personal skill or payment tier. The state of the art, for you, is Fable 5.

Key points

Fable 5 (claude-fable-5) and Mythos 5 (claude-mythos-5) are the same underlying Mythos-class model; Fable 5 is GA, Mythos 5 is invite-only via Project Glasswing.
Fable 5's three safety classifiers (cyber, bio/chem, distillation) fire in under 5% of sessions and fall back to Opus 4.8 instead of a flat refusal.
US export controls forced a global suspension of both models on June 12, 2026; Fable 5 returned worldwide July 1, 2026 with a stronger anti-jailbreak classifier.
Fable 5 pricing: $10/$50 per million input/output tokens, 1M context window, 128K max output, adaptive thinking always on.

Work with me

Need this level of execution on your project?

I am Pierre Bottazzi. I built this entire course solo, end to end: 237 lessons in 3 languages, the app, the design, the SEO, the accounts system. That is what I do for clients too: web apps, mobile apps, AI automation, SEO/GEO. First call is free, no strings attached.

Contact me on LinkedIn See sept-tools.com (industry)See totemsauvage.com (art gallery)

Inspiration

Inspired by 0xloucash

One of my inspirations. Loucash (0xloucash) has a gift for always digging up the sharpest AI tips and tricks, then turning them into setups that actually work. With InstallClaw he configures your own OpenClaw AI agent, at your place, in 48 hours.

His Instagram InstallClaw