How Does AI Think? Part 4 — Under the Hood: Tokenizers, Training & the Art of Making AI Work
📖 This is Part 4 of the "How Does AI Think?" series. Part 1 · Part 2 · ← Part 3: Agents, Reasoning & AGI
Introduction — Lifting the Hood
The first three parts told the story of AI: what it can do, how it evolved, and where it might be going. This part answers a different question: how does it actually work in practice?
When you type "Explain quantum mechanics" into ChatGPT and get a coherent answer in seconds, a remarkable chain of engineering is invisibly at play: your text is split into tokens, mapped into a high-dimensional embedding space, processed through billions of parameters in a transformer, and the output is generated token by token using clever inference optimisations. The model itself was pre-trained on trillions of tokens, fine-tuned with human feedback, and prompted to behave helpfully.
Let's trace that chain, piece by piece.
1. Why Tokenize?
Neural networks operate on numbers, not letters. Tokenization is the process of converting raw text into a sequence of integer IDs drawn from a fixed vocabulary. The choice of tokenizer profoundly affects model quality, speed, and cost.
Early NLP used word-level tokenization — one token per word. The problem: vocabularies explode (English alone has 170,000+ words), rare words get no representation, and the model can't handle misspellings, new coinages, or non-English scripts gracefully.
Character-level tokenization goes the other way: one token per character. Tiny vocabulary (~256), but sequences become very long ("hello" = 5 tokens), making attention $O(n^2)$ expensive and long-range dependencies harder to learn.
1.1 BPE — Byte Pair Encoding (Sennrich et al., 2016)
The modern solution is subword tokenization, and the dominant algorithm is Byte Pair Encoding (BPE). It starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair:
- Start with individual characters:
l o w e r - Count all bigrams; the most frequent pair is (say)
e r - Merge into a new token:
l o w er - Repeat until the vocabulary reaches the desired size (e.g., 32K, 100K, 200K)
The result: common words like "the" become a single token, while rare words like "defenestration" get split into subwords ("def" + "en" + "est" + "ration"). This balances vocabulary size, sequence length, and coverage of all possible inputs.
1.2 Tokenizer Comparison
| Model | Tokenizer | Vocab Size | Notes |
|---|---|---|---|
| GPT-2 | BPE | 50,257 | Byte-level BPE, no unknown tokens |
| GPT-3/3.5 | BPE (tiktoken) | 50,257 | Same as GPT-2 |
| GPT-4 / o1 | BPE (cl100k) | 100,277 | Better multilingual, code handling |
| Llama 3 | BPE (tiktoken) | 128,256 | 4× larger vocab for efficiency |
| Claude 3 | BPE | ~100K | Anthropic's custom tokenizer |
| T5 / Gemini | SentencePiece (Unigram) | 32–256K | Probabilistic subword model |
| BERT | WordPiece | 30,522 | Similar to BPE with likelihood-based merges |
🔤 Interactive: BPE Tokenizer Visualizer
Type any text and see how a BPE-like tokenizer splits it into subword tokens. Each token gets a unique colour. Observe how common words are single tokens while rare or complex words get broken into pieces.
1.3 Why Tokenization Matters
- Cost — API pricing is per-token. A word-heavy language (German compound nouns!) costs more tokens than concise English for the same meaning.
- Context window — GPT-4's 128K context means 128K tokens, not characters. Efficient tokenization lets you fit more content.
- Multilingual fairness — early tokenizers were trained primarily on English, so Chinese or Arabic text consumed 2–3× more tokens per character than English. Newer tokenizers with larger vocabularies are more equitable.
- Math and counting failures — "How many r's in strawberry?" is hard because the model never sees individual letters — it sees tokens like "str", "aw", "berry".
2. From Token IDs to Meaning
A token ID like 15339 is meaningless to a neural network. The embedding layer
maps each token ID to a dense vector of (typically) 768 to 12,288 dimensions. These vectors
are learned during training — the model discovers that similar words should
have similar vectors.
Mathematically, if we have vocabulary size $V$ and embedding dimension $d$, the embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$ maps token $t$ to vector $\mathbf{e}_t = \mathbf{E}[t]$.
2.1 The Magic of Vector Arithmetic
The famous result from Word2Vec (Mikolov et al., 2013):
$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$
This isn't just a party trick — it shows that the embedding space has semantic structure. Directions in the space correspond to concepts: gender, tense, plurality, geography, and more. This geometric view of meaning underlies everything from search engines (find documents near a query vector) to RAG (Retrieval Augmented Generation).
🌐 Interactive: 2D Embedding Space Explorer
Explore a simplified 2D projection of word embeddings. Words that are semantically related cluster together. Drag to pan, scroll to zoom. Click a word to see its nearest neighbours.
2.2 Positional Encodings
Tokens enter the transformer as embeddings, but embeddings alone don't encode position — "the cat sat on the mat" and "the mat sat on the cat" would have the same token set. Positional encodings add position information:
- Sinusoidal (original Transformer) — fixed sine/cosine waves at different frequencies: $PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$
- Learned absolute (GPT-2, BERT) — a separate embedding table for positions 0..N
- RoPE (Rotary, used by Llama, Mistral) — encodes position as a rotation in embedding space, enabling extrapolation to longer sequences than trained on
- ALiBi (Attention with Linear Biases) — adds a linear penalty to attention scores based on distance, implicitly creating position awareness
3. How Models Are Trained
3.1 The Pipeline
Data Collection
Crawl the internet: Common Crawl (petabytes), books, code (GitHub), Wikipedia, academic papers
Data Cleaning & Filtering
Deduplication, quality filtering (perplexity scoring), toxic content removal, PII redaction
Tokenization
Train a BPE tokenizer on the corpus, then encode all text into token sequences
Pre-training (Next Token Prediction)
Train the transformer on trillions of tokens to predict the next token. Months on thousands of GPUs.
Supervised Fine-Tuning (SFT)
Train on curated (prompt, ideal_response) pairs to teach conversational format and instruction-following
RLHF / DPO Alignment
Use human preference data to steer the model toward helpful, harmless, and honest responses
3.2 Pre-training — The Big Computation
The core pre-training objective is deceptively simple: predict the next token. Given a sequence of tokens $t_1, t_2, \dots, t_{n-1}$, maximise:
$\mathcal{L} = -\sum_{i=1}^{N} \log P(t_i | t_1, \dots, t_{i-1}; \theta)$
This is the cross-entropy loss. The entire model — billions of parameters $\theta$ — is optimised with AdamW gradient descent over trillions of tokens. The magic is that to predict the next token well, the model must implicitly learn grammar, facts, reasoning patterns, coding conventions, and more.
3.3 The Scale of Training
📊 Interactive: Training Cost Breakdown
How much does it cost to train a frontier model? Approximate breakdown for a hypothetical 400B model.
3.4 Distributed Training
No single GPU can hold a frontier model. Training is distributed across thousands of GPUs using parallelism strategies:
Data Parallelism
Each GPU has a copy of the model, processes different data batches, and gradients are averaged across GPUs. Simple but memory-hungry.
Tensor Parallelism
Split individual matrix multiplications across GPUs. A single layer's computation is shared between devices. Requires fast interconnect (NVLink).
Pipeline Parallelism
Different layers live on different GPUs. Data flows through a pipeline, with micro-batching to keep all GPUs busy.
Modern training frameworks (Megatron-LM, DeepSpeed, FSDP) combine all three. The Llama 3 405B was trained on 16,384 H100 GPUs simultaneously — an engineering feat as impressive as the model architecture itself.
3.5 Chinchilla Scaling Laws
DeepMind's Chinchilla paper (Hoffmann et al., 2022) showed that many large models were under-trained: given a fixed compute budget, it's better to train a smaller model on more data than a larger model on less data. The optimal ratio:
$\text{Optimal tokens} \approx 20 \times \text{parameters}$
A 70B model should train on ~1.4 trillion tokens. This shifted the industry toward higher-quality, more abundant data rather than just bigger models.
4. From Base Model to Assistant
A base model (pre-trained only on next-token prediction) is like a brilliant but uncooperative savant. It can complete text — but it won't follow instructions, answer questions, or refuse harmful requests. It just predicts what comes next in the training distribution.
Fine-tuning transforms it into a useful assistant.
4.1 Supervised Fine-Tuning (SFT)
Train the model on high-quality (instruction, response) pairs. Tens of thousands of examples covering diverse tasks: coding, math, creative writing, summarisation, multi-turn conversation. The model learns the format of helpful interaction.
4.2 LoRA — Low-Rank Adaptation
Full fine-tuning updates all parameters — expensive and requires as much memory as pre-training. LoRA (Hu et al., 2021) freezes the pre-trained weights and injects small trainable low-rank matrices into each layer:
$\mathbf{W}' = \mathbf{W} + \Delta\mathbf{W} = \mathbf{W} + \mathbf{B}\mathbf{A}$
Where $\mathbf{W} \in \mathbb{R}^{d \times d}$ is frozen, and $\mathbf{B} \in \mathbb{R}^{d \times r}$, $\mathbf{A} \in \mathbb{R}^{r \times d}$ with rank $r \ll d$ (typically 4–64). This means updating only $2dr$ parameters instead of $d^2$ — often 0.1% of the total.
📊 Interactive: LoRA Rank vs Parameters
See how LoRA rank affects the number of trainable parameters. For a model with $d = 4096$ hidden dimension (like Llama 7B), vary the rank to see the tradeoff.
4.3 RLHF and DPO
After SFT, the model follows instructions but may still produce harmful, biased, or unhelpful responses. RLHF (Reinforcement Learning from Human Feedback) aligns it with human preferences:
- Collect comparisons — humans rank multiple model responses to the same prompt
- Train a reward model — a neural network that predicts human preference scores
- Optimise with PPO — use RL to make the LLM generate responses that score high on the reward model, with a KL penalty to prevent drifting too far from the SFT model
DPO (Direct Preference Optimisation, Rafailov et al., 2023) simplifies this: skip the reward model entirely and directly optimise the policy from preference pairs, treating it as a classification problem:
$\mathcal{L}_{\text{DPO}} = -\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$
Where $y_w$ is the preferred response and $y_l$ the dispreferred one. Simpler, more stable, and increasingly the default over PPO-based RLHF.
5. The Art of Serving Models
Training a model is expensive but happens once. Inference — running the model to generate responses — happens billions of times. Making inference fast, cheap, and memory-efficient is where much of the engineering effort happens.
5.1 The KV-Cache
Autoregressive generation produces one token at a time. Naively, each new token requires recomputing attention over all previous tokens — $O(n^2)$ per token. The KV-cache stores the Key and Value vectors from previous tokens so they don't need to be recomputed:
- Prefill phase — process the entire prompt in parallel, compute all K/V vectors, store them
- Decode phase — for each new token, compute only the new Q/K/V, attend to cached K/V
This turns generation from $O(n^2)$ to $O(n)$ per token — but the cache consumes memory. For a 70B model with 128K context, the KV-cache can be 50+ GB.
⚡ Interactive: KV-Cache Visualisation
Watch how the KV-cache grows as tokens are generated. Each row is an attention layer, each cell is a cached (K,V) pair for a token position.
Tokens generated: 0 · Cache entries: 0 ·
5.2 Quantization
Pre-trained weights are stored as 16-bit floating-point (FP16/BF16). Quantization reduces precision to 8-bit, 4-bit, or even 2-bit integers, dramatically shrinking model size and speeding up inference:
| Precision | Memory (70B model) | Quality Loss | Speed |
|---|---|---|---|
| FP16 (baseline) | ~140 GB | None | 1× |
| INT8 (GPTQ/AWQ) | ~70 GB | Minimal | ~1.5× |
| INT4 (GGUF Q4) | ~35 GB | Small | ~2× |
| INT2/3 (extreme) | ~18 GB | Noticeable | ~3× |
📊 Interactive: Quantization Impact
See how reducing bit-width affects the representation of a weight value. The original FP16 value gets rounded to fewer levels, introducing quantization error.
5.3 Speculative Decoding
Speculative decoding uses a small, fast "draft" model to generate several candidate tokens ahead, then the large model verifies them in one parallel forward pass. If the draft tokens are correct (they usually are for common text), you get multiple tokens for the cost of one large-model inference. Speed-ups of 2–3× are common.
5.4 Other Optimisations
- Flash Attention (Dao et al., 2022) — reorders attention computation to minimise GPU memory reads/writes (IO-aware), achieving 2–4× speedup and enabling much longer contexts
- Paged Attention (vLLM) — manages KV-cache memory like OS virtual memory with paging, eliminating fragmentation and enabling higher batch throughput
- Mixture of Experts (MoE) — only activate a subset of parameters per token. Mixtral 8×7B has 47B total params but only 13B active per token — nearly as fast as a 13B model with near-47B quality
- Continuous batching — dynamically add/remove requests from a batch as they finish, maximising GPU utilisation
6. Crafting Effective Prompts
The same model can produce wildly different outputs depending on how you ask. Prompt engineering is the practice of designing inputs that reliably elicit the desired behaviour — without changing the model's weights.
6.1 Key Techniques
System Prompts
Set the model's persona, constraints, and output format upfront. "You are a helpful coding assistant. Always include code examples. Never execute destructive commands."
Few-Shot Examples
Show the model 2–5 input/output examples before your actual question. The model learns the pattern and applies it to new inputs.
Chain-of-Thought
Add "Think step by step" or provide a worked example with reasoning. Dramatically improves math, logic, and complex task performance.
Structured Output
Request JSON, XML, or specific formats. Modern models support JSON mode and function calling for reliable structured output.
🎯 Interactive: Prompt Engineering Playground
Compare how different prompt strategies affect model output quality. Select a task and prompting approach to see simulated results.
6.2 Temperature and Sampling
When the model produces a probability distribution over the next token, the sampling strategy determines which token is actually chosen:
- Temperature ($T$) — divides logits by $T$ before softmax. $T \to 0$ makes output deterministic (always pick the highest-probability token). $T > 1$ makes it more random.
- Top-p (nucleus sampling) — only sample from the smallest set of tokens whose cumulative probability exceeds $p$. E.g., top-p=0.9 ignores the unlikely 10% tail.
- Top-k — only consider the $k$ most probable tokens.
$P(t_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$
For factual tasks, use low temperature (0.0–0.3). For creative writing, try 0.7–1.0. For brainstorming, go up to 1.2+.
7. Beyond Text
The transformer architecture is remarkably general — it processes sequences of tokens, and those tokens don't have to be text. By converting images, audio, and video into token-like representations, the same architecture handles multiple modalities.
7.1 Vision Transformers (ViT)
ViT (Dosovitskiy et al., 2020) splits an image into 16×16 patches, flattens each patch into a vector, and processes them through a standard transformer. The patch embeddings are the image's "tokens."
For multimodal LLMs (GPT-4V, Claude 3, Gemini), a vision encoder (often a CLIP ViT) converts the image into a sequence of embedding vectors that are concatenated with the text token embeddings and fed into the LLM together.
7.2 CLIP — Connecting Vision and Language
CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) trains two encoders — one for images, one for text — so that matching image-text pairs have similar embeddings while non-matching pairs are pushed apart:
$\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_i) / \tau)}{\sum_j \exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j) / \tau)}$
This creates a shared embedding space where "a photo of a dog" and an actual dog photo are nearby. CLIP powers image search, zero-shot classification, and image generation guidance.
7.3 Audio and Speech
- Whisper (OpenAI, 2022) — encoder-decoder transformer trained on 680K hours of multilingual audio. Transcribes speech in 99 languages with near-human accuracy.
- Audio tokens — models like AudioLM and MusicGen convert audio into discrete tokens using neural codecs (SoundStream, EnCodec), then model them as a token sequence.
- GPT-4o real-time voice — processes audio natively (not speech→text→LLM→text→speech), enabling natural conversation with emotions, accents, and singing.
7.4 Video and Beyond
Video is just images over time. Sora (OpenAI, 2024) generates video by operating in a latent space-time patch representation — treating video as a 3D grid of tokens (spatial × temporal). This same approach scales to generate long, physically coherent scenes.
The trajectory is clear: everything becomes tokens. Text, images, audio, video, code, structured data, tool calls, robotic actions — all expressed as sequences that a unified transformer can process. This convergence is what makes "foundation models" truly foundational.
📊 Interactive: Multimodal Model Capabilities Timeline
The rapid expansion of what AI models can process and generate.
8. From Keystroke to Answer
Let's trace the complete journey when you type a question into a model:
- Tokenization — your text is split into subword tokens by BPE. "What is photosynthesis?" → token IDs [3923, 374, 14767, 25247, 30]
- Embedding + Positional Encoding — each token ID → 4096-dimensional vector, plus position information (RoPE rotation)
- Prefill — all prompt tokens processed in parallel through N transformer layers. K/V vectors cached for every layer.
- Decode Loop — for each output token:
- Compute Q for the new position, attend to cached K/V
- Pass through MLP layers
- Project to vocabulary logits (size 128K)
- Apply temperature and top-p sampling
- Output the selected token, cache its K/V
- Check for stop condition (EOS token, max length)
- Detokenization — token IDs → text string, streamed to your screen
For a 70B model generating 100 tokens: ~7 trillion multiply-accumulate operations in the prefill phase, then ~70 billion per decode step. The KV-cache holds ~2GB for a 4K context. The whole thing takes 2–5 seconds on 8 GPUs.
That's the engineering stack behind the magic.
9. Wrapping Up — The Four-Part Arc
Over four posts, we've traced the complete story of "How Does AI Think?":
- Part 1 — Logic to Statistics: From Ramon Llull's mechanical reasoning (1275) through Babbage, Turing, LISP, search algorithms, expert systems, the AI winters, and the revival through linear regression and gradient descent.
- Part 2 — Perception to Prediction: The deep learning revolution — perceptrons, backpropagation, CNNs, RNNs, word embeddings, attention, transformers, BERT, GPT, ChatGPT, and the age of foundation models.
- Part 3 — Action to Understanding: Reinforcement learning, game-playing AI, agents, tool use, chain-of-thought reasoning, the open-source revolution, alignment, interpretability, and the road to AGI.
- Part 4 — Under the Hood: Tokenization, embeddings, the training pipeline, fine-tuning with LoRA and RLHF, inference optimisation, prompt engineering, and multimodal AI — the engineering that makes it all work.
The AI field is moving faster than any technology in human history. By the time you finish reading this, there's probably a new model release, a new benchmark record, or a new open-source breakthrough. But the fundamentals — tokens, transformers, training, and the interplay of scale and architecture — will remain relevant far longer than any individual model name.
The best way to understand AI isn't just to read about it. It's to build with it. Run Ollama on your laptop. Fine-tune a small model with LoRA. Write a RAG pipeline. Build an agent. The tools have never been more accessible.
The future of AI is being built right now — and you can be part of it.
Read Part 5: AI in the Wild — Applications & the Developer's Toolkit →
Further Reading
- Sennrich et al. — "Neural Machine Translation of Rare Words with Subword Units" (BPE, 2016)
- Kudo & Richardson — "SentencePiece: A simple and language independent subword tokenizer" (2018)
- Vaswani et al. — "Attention Is All You Need" (Transformer, 2017)
- Hoffmann et al. — "Training Compute-Optimal Large Language Models" (Chinchilla, 2022)
- Hu et al. — "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
- Rafailov et al. — "Direct Preference Optimization" (DPO, 2023)
- Dao et al. — "FlashAttention: Fast and Memory-Efficient Exact Attention" (2022)
- Kwon et al. — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM, 2023)
- Leviathan et al. — "Fast Inference from Transformers via Speculative Decoding" (2023)
- Radford et al. — "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
- Dosovitskiy et al. — "An Image is Worth 16x16 Words" (ViT, 2020)
- Radford et al. — "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper, 2022)
← Part 1: Logic Machines to Linear Regression · Part 2: Neural Nets to Transformers · Part 3: Agents, Reasoning & AGI · Part 5: AI in the Wild →