The Claude Bible
Home / LLM foundations
Level: Beginner · 10 lessons

LLM foundations

Tokens, context, temperature, the Claude family. The bedrock.

Open the interactive course212 lessons, quizzes, exercises, 3 languages, free.

What an LLM really is

A large language model (LLM) does not "understand" like a human and does not "look things up" in a database. It does one thing, billions of times: predict the next chunk of text most likely given everything that came before.

Practical consequences, which explain 90% of the surprises:

Keep the image: a brilliant but amnesiac improviser. Your job is to give it the right set (context) for each scene.

Key points
  • An LLM predicts the next token, nothing else
  • Hallucinating = producing plausible-but-false output, a structural risk
  • No memory between sessions: everything lives in the context
  • Knowledge frozen at a cutoff date

Tokens and the context window

The model does not see letters or words, but tokens: fragments of text. Roughly, 1 token is about 4 characters, around 0.75 word in English. "interesting" can be 2-3 tokens.

The context window is the maximum number of tokens the model can process at once: your prompt + the history + the files + its answer. Recent Claude models go up to 200K tokens, and some configurations up to 1M tokens (the model writing this runs at 1M).

Why it is central:

Module 7 is entirely dedicated to mastering context and cost. For now, remember the unit: the token.

Key points
  • Token = fragment of text, ~4 characters
  • Context window = input + history + files + output, capped (200K, sometimes 1M)
  • Everything is billed in tokens, input and output
  • Too much noisy context degrades quality, not just cost

Temperature, and the myth of the magic setting

Temperature tunes the randomness of the prediction. Low (0 to 0.3): the model almost always picks the most likely token, stable and predictable answers, good for code, extraction, classification. High (0.7 to 1): more diversity, good for brainstorming and creativity.

Two other settings you will meet in the API:

The classic beginner trap: believing you fix a bad result by fiddling with temperature. 95% of the time the problem is the prompt, not the setting. A clear prompt at temperature 0.3 beats a vague prompt at any temperature. We tune temperature last, not first.

Key points
  • Low temperature = stable/factual; high = creative/diverse
  • max_tokens caps the output (watch for cut-offs)
  • A bad result almost always comes from the prompt, not the temperature

The Claude family: Opus, Sonnet, Haiku

Anthropic ships each generation in three sizes, which trade intelligence for speed and cost:

Model identifiers (useful in the API and in Claude Code) for the current generation:

Pierre's rule, applied across his practice: Opus for architecture, brainstorming and debugging; delegate the repetitive, the multi-language and the auditing to Sonnet or Haiku via sub-agents. More on this in the multi-agent module. On billing, Claude calls are the cheap resource in his setup: only paid external services really count.

Key points
  • Opus = power, Sonnet = balance, Haiku = speed/volume
  • Same ids everywhere: claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5-...
  • Pierre: Opus for architecture/debug, Sonnet/Haiku for delegated repetitive work

Meaning without a dictionary: embeddings

When a language model reads a word or sentence, it does not look it up in a dictionary. Instead it converts the text into a vector, which is a long list of numbers (often hundreds or thousands of values). That list is called an embedding. Each number captures a tiny facet of meaning, so the whole list together represents what the text "means" to the model.

The key insight is that similar meanings produce similar vectors. In the mathematical space where these vectors live (called embedding space), words and phrases cluster by meaning. "Doctor" and "physician" end up close together. "Dog" and "cat" are nearby each other but far from "invoice". The model never needed a rule saying those words are related; it learned the positions by processing billions of sentences.

This geometry of meaning is what lets Claude answer questions, find relevant passages, and understand context. When you ask a question, the question is turned into a vector, and the model finds content whose vector sits nearby in embedding space. That process is called semantic search (search by meaning, not by exact words).

Key points
  • Embeddings convert text into lists of numbers
  • Similar meanings sit close in embedding space
  • Semantic search uses vector distance, not keywords
  • Models learn these positions from data, not rules

Sampling: why the same prompt varies

Every time a language model generates text, it picks words one at a time. After each word it looks at a probability list: thousands of candidate next words, each with a score. The way it picks from that list is called decoding, and it is the main reason two identical prompts can produce different answers.

Greedy decoding always picks the single highest-scoring word. It is fast and fully deterministic (meaning the output is always the same), but it tends to produce flat, repetitive text. Sampled decoding introduces randomness: the model draws from the probability list rather than always taking the top item. The degree of randomness is controlled by temperature (covered in the next lesson) and by two filters applied before sampling:

In practice, top-p and top-k are often applied together before temperature-based sampling. Claude's API exposes both parameters. Raising p or k widens the pool and increases variety; lowering them makes the model more predictable. Setting temperature to 0 collapses back to greedy decoding regardless of top-p or top-k settings.

Key points
  • Greedy decoding always picks the highest-probability word, giving deterministic output.
  • Top-k limits candidates to the k most likely words at each step.
  • Top-p (nucleus sampling) keeps the smallest set of words covering p of the total probability.
  • Sampled decoding introduces useful variety; temperature 0 removes it.

Three voices: system, user, assistant

Every conversation sent to an LLM (large language model) is made of messages, and each message belongs to one of three roles: system, user, or assistant. Understanding these roles tells you exactly how Claude is instructed, who is speaking, and what Claude is allowed to say.

The system prompt is set by whoever builds the product (a developer, a company, or Claude Code itself). It arrives before the conversation begins and tells Claude how to behave: its persona, its limits, its task. The user never sees it unless the builder chooses to show it.

The user turn is your message: the question, instruction, or file you send. The assistant turn is Claude's reply. These two alternate back and forth to form the conversation history that Claude reads every time it responds.

Key points
  • The system prompt is invisible to the user but controls Claude's behavior.
  • User and assistant turns alternate to form the conversation history.
  • Claude reads the full history on every reply, not just the last message.
  • Knowing which role holds which text helps you debug unexpected behavior.

How Claude was trained

Claude starts life like every large language model (LLM): it goes through pretraining, where it reads a massive portion of the internet, books, and code. During this phase the model learns grammar, facts, reasoning patterns, and writing styles purely by predicting the next word, billions of times over. No human guidance yet, just statistics at enormous scale.

Next comes RLHF (Reinforcement Learning from Human Feedback). Human trainers rate pairs of model responses, and those ratings are used to train a separate "preference model." Claude is then fine-tuned to produce outputs that score well on that preference model. This is how raw text prediction becomes a helpful assistant that follows instructions and avoids obvious mistakes.

Anthropic adds a third layer called Constitutional AI (CAI). Instead of relying only on human raters, CAI gives the model a written set of principles (a "constitution") and has the model critique and revise its own answers against those principles. This makes the alignment process more scalable and more transparent, because the rules are explicit rather than buried in rater intuitions.

These three phases shape everything you experience when talking to Claude:

Key points
  • Pretraining: learning language from raw text at scale
  • RLHF: shaping behavior with human preference ratings
  • Constitutional AI: self-critique against written principles
  • Training phases determine knowledge, helpfulness, and safety limits

Attention and why position matters

Every modern LLM (large language model) is built on a mechanism called attention. When the model reads your prompt, it does not treat every word equally. Instead it scores each word (or token) against every other word and decides which ones are most relevant to each step of the answer. Think of it as the model asking: "to write this next word, which earlier words should I lean on most?"

Because attention scores are computed across the entire context window (the total text the model can see at once), the model can in theory connect any two pieces of information, no matter how far apart. In practice, though, researchers have observed a pattern called lost-in-the-middle: models tend to recall information placed at the very beginning or at the very end of a long prompt far better than information buried in the middle.

This has a direct, practical consequence for how you structure prompts and documents you pass to Claude:

The same principle applies when you feed Claude a long document and ask a question about it. Place your question before the document, restate it briefly after, and highlight the relevant section with a label. That sandwich structure fights the lost-in-the-middle effect and consistently produces better answers.

Key points
  • Attention weights every token against every other token to decide relevance
  • Lost-in-the-middle: information buried in a long prompt is recalled least reliably
  • Place tasks early, critical constraints late, and use structure to signal importance
  • Restating a question before and after a long document improves recall

The knowledge cutoff and grounding

Every large language model (LLM) is trained on a snapshot of text gathered up to a specific date, called the knowledge cutoff. After that date, the model has no awareness of new events, updated prices, revised laws, or anything else that changed. Claude's knowledge cutoff is August 2025, so it cannot answer reliably about things that happened after that point.

This creates a practical problem: the world keeps moving while the model stays frozen. A question about current stock prices, the latest software release, or a recent political event will likely produce an outdated or simply wrong answer, even from a capable model. The model does not know what it does not know, so it may answer with false confidence.

Grounding is the technique used to fix this. It means giving the model access to fresh, reliable information at the moment it answers, rather than relying only on what it memorized during training. The two most common grounding methods are:

Grounding does not make the model infallible, but it shifts the bottleneck from frozen training data to the quality of the sources you provide. Always cite or check those sources independently for anything that matters.

Key points
  • Knowledge cutoff: the date beyond which a model has no training data
  • Grounding: supplying current sources so the model reasons over fresh facts
  • Web search integration injects live results into the model context
  • Pasting or attaching text is the simplest form of manual grounding
Work with me

Master Claude, Claude Code and LLMs, from your first prompt to multi-agent orchestration.

Like this course? I built it end to end. Need a web app, mobile app, AI automation or SEO/GEO? Let us talk.

Contact me on LinkedInSee a site I built