A large language model (LLM) does not "understand" like a human and does not "look things up" in a database. It does one thing, billions of times: predict the next chunk of text most likely given everything that came before.
Practical consequences, which explain 90% of the surprises:
It is excellent at patterns (style, structure, idiomatic code) because those are regularities of language.
It can invent with confidence (hallucinate): a false but plausible text is still statistically likely. Hence the rule later: give it the sources, do not trust its memory for facts.
It has no memory between conversations. Everything it "knows" about you lives in the context handed back to it on every call.
Its knowledge stops at a training cutoff date. For current events, you must give it a web search.
Keep the image: a brilliant but amnesiac improviser. Your job is to give it the right set (context) for each scene.
Key points
An LLM predicts the next token, nothing else
Hallucinating = producing plausible-but-false output, a structural risk
No memory between sessions: everything lives in the context
Knowledge frozen at a cutoff date
Tokens and the context window
The model does not see letters or words, but tokens: fragments of text. Roughly, 1 token is about 4 characters, around 0.75 word in English. "interesting" can be 2-3 tokens.
The context window is the maximum number of tokens the model can process at once: your prompt + the history + the files + its answer. Recent Claude models go up to 200K tokens, and some configurations up to 1M tokens (the model writing this runs at 1M).
Why it is central:
Everything is paid in tokens (input + output). More context = more expensive and slower.
When the window fills up, you must summarize or clean (we will see /compact and /clear in Claude Code).
A context overloaded with noise degrades quality: useful signal drowns. "More context" is not "better".
Module 7 is entirely dedicated to mastering context and cost. For now, remember the unit: the token.
Key points
Token = fragment of text, ~4 characters
Context window = input + history + files + output, capped (200K, sometimes 1M)
Everything is billed in tokens, input and output
Too much noisy context degrades quality, not just cost
Temperature, and the myth of the magic setting
Temperature tunes the randomness of the prediction. Low (0 to 0.3): the model almost always picks the most likely token, stable and predictable answers, good for code, extraction, classification. High (0.7 to 1): more diversity, good for brainstorming and creativity.
Two other settings you will meet in the API:
max_tokens: the maximum length of the answer. Too low = cut-off response.
top_p: an alternative to temperature (nucleus sampling). Usually you touch one or the other, not both.
The classic beginner trap: believing you fix a bad result by fiddling with temperature. 95% of the time the problem is the prompt, not the setting. A clear prompt at temperature 0.3 beats a vague prompt at any temperature. We tune temperature last, not first.
Key points
Low temperature = stable/factual; high = creative/diverse
max_tokens caps the output (watch for cut-offs)
A bad result almost always comes from the prompt, not the temperature
The Claude family: Opus, Sonnet, Haiku
Anthropic ships each generation in three sizes, which trade intelligence for speed and cost:
Opus: the most capable. Architecture, hard reasoning, gnarly debugging, brainstorming. The slowest and most expensive.
Sonnet: the balance. The daily workhorse, very good quality/cost ratio.
Haiku: the fastest and cheapest. Repetitive tasks, classification, volume, multi-language.
Model identifiers (useful in the API and in Claude Code) for the current generation:
Opus 4.8: claude-opus-4-8
Sonnet 4.6: claude-sonnet-4-6
Haiku 4.5: claude-haiku-4-5-20251001
Pierre's rule, applied across his practice: Opus for architecture, brainstorming and debugging; delegate the repetitive, the multi-language and the auditing to Sonnet or Haiku via sub-agents. More on this in the multi-agent module. On billing, Claude calls are the cheap resource in his setup: only paid external services really count.
Key points
Opus = power, Sonnet = balance, Haiku = speed/volume
Same ids everywhere: claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5-...
Pierre: Opus for architecture/debug, Sonnet/Haiku for delegated repetitive work
Meaning without a dictionary: embeddings
When a language model reads a word or sentence, it does not look it up in a dictionary. Instead it converts the text into a vector, which is a long list of numbers (often hundreds or thousands of values). That list is called an embedding. Each number captures a tiny facet of meaning, so the whole list together represents what the text "means" to the model.
The key insight is that similar meanings produce similar vectors. In the mathematical space where these vectors live (called embedding space), words and phrases cluster by meaning. "Doctor" and "physician" end up close together. "Dog" and "cat" are nearby each other but far from "invoice". The model never needed a rule saying those words are related; it learned the positions by processing billions of sentences.
This geometry of meaning is what lets Claude answer questions, find relevant passages, and understand context. When you ask a question, the question is turned into a vector, and the model finds content whose vector sits nearby in embedding space. That process is called semantic search (search by meaning, not by exact words).
Vector: a list of numbers that encodes a concept.
Embedding: the specific vector a model assigns to a piece of text.
Embedding space: the multi-dimensional map where all those vectors live.
Semantic search: finding text by meaning-distance rather than keyword matching.
Key points
Embeddings convert text into lists of numbers
Similar meanings sit close in embedding space
Semantic search uses vector distance, not keywords
Models learn these positions from data, not rules
Sampling: why the same prompt varies
Every time a language model generates text, it picks words one at a time. After each word it looks at a probability list: thousands of candidate next words, each with a score. The way it picks from that list is called decoding, and it is the main reason two identical prompts can produce different answers.
Greedy decoding always picks the single highest-scoring word. It is fast and fully deterministic (meaning the output is always the same), but it tends to produce flat, repetitive text. Sampled decoding introduces randomness: the model draws from the probability list rather than always taking the top item. The degree of randomness is controlled by temperature (covered in the next lesson) and by two filters applied before sampling:
Top-k filtering: keep only the k highest-scoring candidates and discard the rest. If k is 40, only the 40 most likely words are eligible at each step.
Top-p filtering (nucleus sampling): keep the smallest set of candidates whose combined probability adds up to p. If p is 0.9, words that together account for 90 percent of the probability mass are kept; the long tail of unlikely words is cut off. This adapts dynamically: when the model is very confident, fewer words make the cut.
In practice, top-p and top-k are often applied together before temperature-based sampling. Claude's API exposes both parameters. Raising p or k widens the pool and increases variety; lowering them makes the model more predictable. Setting temperature to 0 collapses back to greedy decoding regardless of top-p or top-k settings.
Key points
Greedy decoding always picks the highest-probability word, giving deterministic output.
Top-k limits candidates to the k most likely words at each step.
Top-p (nucleus sampling) keeps the smallest set of words covering p of the total probability.
Sampled decoding introduces useful variety; temperature 0 removes it.
Three voices: system, user, assistant
Every conversation sent to an LLM (large language model) is made of messages, and each message belongs to one of three roles: system, user, or assistant. Understanding these roles tells you exactly how Claude is instructed, who is speaking, and what Claude is allowed to say.
The system prompt is set by whoever builds the product (a developer, a company, or Claude Code itself). It arrives before the conversation begins and tells Claude how to behave: its persona, its limits, its task. The user never sees it unless the builder chooses to show it.
The user turn is your message: the question, instruction, or file you send. The assistant turn is Claude's reply. These two alternate back and forth to form the conversation history that Claude reads every time it responds.
system: invisible instructions from the builder, sets the rules and persona.
user: your input, the prompt you type or the file you attach.
assistant: Claude's reply, generated from everything above it in the thread.
Key points
The system prompt is invisible to the user but controls Claude's behavior.
User and assistant turns alternate to form the conversation history.
Claude reads the full history on every reply, not just the last message.
Knowing which role holds which text helps you debug unexpected behavior.
How Claude was trained
Claude starts life like every large language model (LLM): it goes through pretraining, where it reads a massive portion of the internet, books, and code. During this phase the model learns grammar, facts, reasoning patterns, and writing styles purely by predicting the next word, billions of times over. No human guidance yet, just statistics at enormous scale.
Next comes RLHF (Reinforcement Learning from Human Feedback). Human trainers rate pairs of model responses, and those ratings are used to train a separate "preference model." Claude is then fine-tuned to produce outputs that score well on that preference model. This is how raw text prediction becomes a helpful assistant that follows instructions and avoids obvious mistakes.
Anthropic adds a third layer called Constitutional AI (CAI). Instead of relying only on human raters, CAI gives the model a written set of principles (a "constitution") and has the model critique and revise its own answers against those principles. This makes the alignment process more scalable and more transparent, because the rules are explicit rather than buried in rater intuitions.
These three phases shape everything you experience when talking to Claude:
Pretraining determines what Claude knows and how it reasons.
RLHF determines how helpful and instruction-following it is.
Constitutional AI determines its safety boundaries and consistent values.
All three together explain why Claude can write code fluently but will decline certain requests without being told to by the user.
Key points
Pretraining: learning language from raw text at scale
RLHF: shaping behavior with human preference ratings
Constitutional AI: self-critique against written principles
Training phases determine knowledge, helpfulness, and safety limits
Attention and why position matters
Every modern LLM (large language model) is built on a mechanism called attention. When the model reads your prompt, it does not treat every word equally. Instead it scores each word (or token) against every other word and decides which ones are most relevant to each step of the answer. Think of it as the model asking: "to write this next word, which earlier words should I lean on most?"
Because attention scores are computed across the entire context window (the total text the model can see at once), the model can in theory connect any two pieces of information, no matter how far apart. In practice, though, researchers have observed a pattern called lost-in-the-middle: models tend to recall information placed at the very beginning or at the very end of a long prompt far better than information buried in the middle.
This has a direct, practical consequence for how you structure prompts and documents you pass to Claude:
Put the task or question first (or at least very early). The model anchors attention on the opening tokens.
Put critical facts or constraints near the end, just before you expect the answer to begin. End-of-prompt content is retrieved reliably.
Avoid burying key rules in the middle of a long block of background text. Those rules are most likely to be ignored or forgotten.
Use structure (headers, bullet lists, explicit labels like "IMPORTANT:") to boost attention on critical passages wherever they live.
The same principle applies when you feed Claude a long document and ask a question about it. Place your question before the document, restate it briefly after, and highlight the relevant section with a label. That sandwich structure fights the lost-in-the-middle effect and consistently produces better answers.
Key points
Attention weights every token against every other token to decide relevance
Lost-in-the-middle: information buried in a long prompt is recalled least reliably
Place tasks early, critical constraints late, and use structure to signal importance
Restating a question before and after a long document improves recall
The knowledge cutoff and grounding
Every large language model (LLM) is trained on a snapshot of text gathered up to a specific date, called the knowledge cutoff. After that date, the model has no awareness of new events, updated prices, revised laws, or anything else that changed. Claude's knowledge cutoff is August 2025, so it cannot answer reliably about things that happened after that point.
This creates a practical problem: the world keeps moving while the model stays frozen. A question about current stock prices, the latest software release, or a recent political event will likely produce an outdated or simply wrong answer, even from a capable model. The model does not know what it does not know, so it may answer with false confidence.
Grounding is the technique used to fix this. It means giving the model access to fresh, reliable information at the moment it answers, rather than relying only on what it memorized during training. The two most common grounding methods are:
Web search integration: the system retrieves live search results and injects them into the model's context before it replies. Claude.ai can do this with its built-in search toggle.
Supplied sources: you paste or attach the relevant text yourself (a document, a webpage excerpt, a data file). The model reasons over what you gave it, not its stale memory.
Grounding does not make the model infallible, but it shifts the bottleneck from frozen training data to the quality of the sources you provide. Always cite or check those sources independently for anything that matters.
Key points
Knowledge cutoff: the date beyond which a model has no training data
Grounding: supplying current sources so the model reasons over fresh facts
Web search integration injects live results into the model context
Pasting or attaching text is the simplest form of manual grounding
Work with me
Master Claude, Claude Code and LLMs, from your first prompt to multi-agent orchestration.
Like this course? I built it end to end. Need a web app, mobile app, AI automation or SEO/GEO? Let us talk.