How Large Language Models (LLMs) Work

Table of Contents

What an LLM Actually Is
Tokens & Tokenization
Training on Text
What "Parameters" Mean
The Transformer & Attention
Next-Token Prediction
Context Window & Temperature
Limitations & Hallucinations

What an LLM Actually Is

A large language model is a program that has been trained to predict the next piece of text given the text that came before it. That is the entire core idea. Everything else — answering questions, writing essays, generating code — is built on top of that single skill, repeated over and over.

It is called large because it contains billions of internal numbers (called parameters) and was trained on enormous amounts of text. It is called a language model because its job is to model the patterns of language: which words tend to follow which, given the surrounding context.

Key mental model: An LLM does not look anything up in a database when it answers. It generates text one chunk at a time by predicting what is statistically most likely to come next, based on patterns it learned during training.

Tokens & Tokenization

LLMs do not read whole words or letters directly. They work with tokens — small pieces of text. A token can be a whole short word, part of a longer word, a space, or a punctuation mark. Breaking text into tokens is called tokenization.

For example, the sentence "Learning to code is fun" might split like this:

Text:    Learning to code is fun

Tokens:  ["Learning", " to", " code", " is", " fun"]
          (notice the leading spaces are part of the tokens)

A longer word splits into pieces:
"tokenization"  ->  ["token", "ization"]

Each token is mapped to a number (an ID) from the model's fixed vocabulary. So the text you type is really turned into a list of integers before the model ever sees it.

Rule of thumb: In English, one token is roughly 4 characters, or about ¾ of a word. So 1,000 tokens is around 750 words. This matters because pricing and length limits are measured in tokens, not words.

Training on Text

Before a model is useful, it has to be trained. Training happens in stages, but the foundational stage is called pretraining.

Pretraining: learning to predict

The model is shown gigantic amounts of text — books, websites, articles, code — and given one simple repeated task: hide the next token and try to guess it. When it guesses wrong, an algorithm nudges its internal numbers slightly so it would have been a little more correct. Repeat this billions of times and the model gradually absorbs grammar, facts, writing styles, and reasoning patterns.

Training loop (conceptually):

  1. Take a sentence:        "The capital of France is ___"
  2. Model predicts:         "London"  (wrong)
  3. Correct answer was:     "Paris"
  4. Adjust parameters so "Paris" becomes more likely next time
  5. Repeat ~trillions of times across all the text

Fine-tuning & alignment

After pretraining, the model can complete text but is not yet a helpful assistant. A second phase — using human feedback and curated examples — teaches it to follow instructions, be helpful, and avoid harmful output. This is why a chatbot answers your question instead of just continuing your sentence.

What "Parameters" Mean

You will often see models described by size: "7 billion parameters", "70B", and so on. A parameter is simply one adjustable number inside the model — like a tiny dial. During training, these dials are tuned so the model's predictions get better.

Think of parameters as the model's stored "knowledge", spread across billions of numbers. No single parameter means anything by itself; the behavior emerges from all of them working together.

Term	Plain meaning
Parameter	One adjustable number (weight) inside the model
Billions of parameters	More dials → more capacity to learn complex patterns
Weights	Another word for the trained parameter values
Model size	Roughly how many parameters it has (e.g. 7B, 70B)

More parameters is not automatically "smarter". Data quality, training method, and fine-tuning matter just as much. A smaller, well-trained model can outperform a larger, poorly trained one.

The Transformer & Attention

Nearly all modern LLMs are built on an architecture called the transformer, introduced in 2017. Its key trick is a mechanism called attention.

Attention in plain language

When the model processes a word, attention lets it look at all the other words in the input and decide which ones are most relevant to understanding the current one. It "pays attention" to the important context and largely ignores the rest.

Consider the word "it" in this sentence:

"The trophy did not fit in the suitcase because it was too big."

When processing "it", attention helps the model link it to
"trophy" (not "suitcase"), because that is what makes the
sentence make sense. Attention scores tell the model which
earlier words to weight most heavily.

Because attention can connect any word to any other word, the model handles long-range relationships well — something older approaches struggled with. Stacking many attention layers lets the model build up an increasingly rich understanding of the input before it predicts anything.

Next-Token Prediction

Here is where everything comes together. When you send a prompt, the model converts it to tokens, runs them through its transformer layers, and produces a probability for every possible next token in its vocabulary.

Prompt:  "The sky is"

Model's predicted probabilities for the next token:

  " blue"    62%
  " clear"   11%
  " falling"  6%
  " grey"     5%
  ...thousands more, each with a tiny probability

It picks one token, appends it, then repeats the whole
process to choose the token after that — one at a time.

This is why responses appear word by word: the model is literally generating one token, adding it to the running text, and feeding everything back in to predict the next. It is a loop, not a single answer pulled from storage.

This is the whole engine. "Reasoning", "writing", and "coding" are all the same underlying loop of next-token prediction — just applied to different kinds of input.

Context Window & Temperature

Context window

The context window is the maximum amount of text (measured in tokens) the model can consider at once — your prompt plus its own reply plus any conversation history. If a conversation grows past this limit, the earliest parts fall out of view and the model effectively "forgets" them.

Context window	Roughly equals
8,000 tokens	~6,000 words (a long article)
128,000 tokens	~96,000 words (a short book)
1,000,000 tokens	~750,000 words (several books)

Temperature

Temperature is a setting that controls how random the token selection is. At a low temperature the model almost always picks the highest-probability token, giving focused, predictable output. At a higher temperature it is more willing to pick less likely tokens, producing more varied and creative — but less reliable — text.

Low temperature  (~0.2):  factual, repetitive, "safe" answers
Medium           (~0.7):  balanced, natural-sounding
High temperature (~1.2):  creative, surprising, more error-prone

Same prompt, low temp:   "The sky is blue."
Same prompt, high temp:  "The sky is a bruised violet at dusk."

Practical tip: Use low temperature for code, data extraction, and factual tasks; use higher temperature for brainstorming, story writing, or generating varied ideas.

Limitations & Hallucinations

Understanding how LLMs work makes their weaknesses easy to predict. Because the model generates statistically likely text rather than retrieving verified facts, it can confidently produce information that is simply wrong. This is called a hallucination.

Hallucinations: The model may invent citations, APIs, dates, or facts that look plausible but do not exist — because "plausible-sounding" is exactly what it optimizes for.
Knowledge cutoff: A model only knows what was in its training data. Events after its cutoff date are unknown unless given to it in the prompt.
No true understanding: It models patterns in language, not meaning or the real world. It has no beliefs, intentions, or live access to facts.
Sensitivity to wording: Small changes in how you phrase a prompt can noticeably change the answer.
Math & precise logic: Without tools, step-by-step arithmetic and exact reasoning can be unreliable.

Always verify. Treat an LLM as a fast, knowledgeable, sometimes-mistaken assistant — not as a source of truth. Double-check facts, code, and anything important before relying on it.

Once you internalize that an LLM is a next-token prediction engine trained on huge amounts of text, its strengths and quirks stop feeling mysterious. It is a remarkably useful tool — as long as you understand what it is actually doing.

How Large Language Models (LLMs) Work — Explained Simply