Ivan Santoso - Blog/how-does-ai-think-part-4

📖 This is Part 4 of the "How Does AI Think?" series. Part 1 · Part 2 · ← Part 3: Agents, Reasoning & AGI

Introduction — Lifting the Hood

The first three parts told the story of AI: what it can do, how it evolved, and where it might be going. This part answers a different question: how does it actually work in practice?

When you type "Explain quantum mechanics" into ChatGPT and get a coherent answer in seconds, a remarkable chain of engineering is invisibly at play: your text is split into tokens, mapped into a high-dimensional embedding space, processed through billions of parameters in a transformer, and the output is generated token by token using clever inference optimisations. The model itself was pre-trained on trillions of tokens, fine-tuned with human feedback, and prompted to behave helpfully.

Let's trace that chain, piece by piece.

1. Why Tokenize?

Neural networks operate on numbers, not letters. Tokenization is the process of converting raw text into a sequence of integer IDs drawn from a fixed vocabulary. The choice of tokenizer profoundly affects model quality, speed, and cost.

Early NLP used word-level tokenization — one token per word. The problem: vocabularies explode (English alone has 170,000+ words), rare words get no representation, and the model can't handle misspellings, new coinages, or non-English scripts gracefully.

Character-level tokenization goes the other way: one token per character. Tiny vocabulary (~256), but sequences become very long ("hello" = 5 tokens), making attention $O(n^2)$ expensive and long-range dependencies harder to learn.

1.1 BPE — Byte Pair Encoding (Sennrich et al., 2016)

The modern solution is subword tokenization, and the dominant algorithm is Byte Pair Encoding (BPE). It starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair:

Start with individual characters: l o w e r
Count all bigrams; the most frequent pair is (say) e r
Merge into a new token: l o w er
Repeat until the vocabulary reaches the desired size (e.g., 32K, 100K, 200K)

The result: common words like "the" become a single token, while rare words like "defenestration" get split into subwords ("def" + "en" + "est" + "ration"). This balances vocabulary size, sequence length, and coverage of all possible inputs.

1.2 Tokenizer Comparison

Model	Tokenizer	Vocab Size	Notes
GPT-2	BPE	50,257	Byte-level BPE, no unknown tokens
GPT-3/3.5	BPE (tiktoken)	50,257	Same as GPT-2
GPT-4 / o1	BPE (cl100k)	100,277	Better multilingual, code handling
Llama 3	BPE (tiktoken)	128,256	4× larger vocab for efficiency
Claude 3	BPE	~100K	Anthropic's custom tokenizer
T5 / Gemini	SentencePiece (Unigram)	32–256K	Probabilistic subword model
BERT	WordPiece	30,522	Similar to BPE with likelihood-based merges

🔤 Interactive: BPE Tokenizer Visualizer

Type any text and see how a BPE-like tokenizer splits it into subword tokens. Each token gets a unique colour. Observe how common words are single tokens while rare or complex words get broken into pieces.

Characters: 0 · Tokens: 0 · Ratio: — chars/token

1.3 Why Tokenization Matters

Cost — API pricing is per-token. A word-heavy language (German compound nouns!) costs more tokens than concise English for the same meaning.
Context window — GPT-4's 128K context means 128K tokens, not characters. Efficient tokenization lets you fit more content.
Multilingual fairness — early tokenizers were trained primarily on English, so Chinese or Arabic text consumed 2–3× more tokens per character than English. Newer tokenizers with larger vocabularies are more equitable.
Math and counting failures — "How many r's in strawberry?" is hard because the model never sees individual letters — it sees tokens like "str", "aw", "berry".

2. From Token IDs to Meaning

A token ID like 15339 is meaningless to a neural network. The embedding layer maps each token ID to a dense vector of (typically) 768 to 12,288 dimensions. These vectors are learned during training — the model discovers that similar words should have similar vectors.

Mathematically, if we have vocabulary size $V$ and embedding dimension $d$, the embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$ maps token $t$ to vector $\mathbf{e}_t = \mathbf{E}[t]$.

2.1 The Magic of Vector Arithmetic

The famous result from Word2Vec (Mikolov et al., 2013):

$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$

This isn't just a party trick — it shows that the embedding space has semantic structure. Directions in the space correspond to concepts: gender, tense, plurality, geography, and more. This geometric view of meaning underlies everything from search engines (find documents near a query vector) to RAG (Retrieval Augmented Generation).

🌐 Interactive: 2D Embedding Space Explorer

Explore a simplified 2D projection of word embeddings. Words that are semantically related cluster together. Drag to pan, scroll to zoom. Click a word to see its nearest neighbours.

2.2 Positional Encodings

Tokens enter the transformer as embeddings, but embeddings alone don't encode position — "the cat sat on the mat" and "the mat sat on the cat" would have the same token set. Positional encodings add position information:

Sinusoidal (original Transformer) — fixed sine/cosine waves at different frequencies: $PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$
Learned absolute (GPT-2, BERT) — a separate embedding table for positions 0..N
RoPE (Rotary, used by Llama, Mistral) — encodes position as a rotation in embedding space, enabling extrapolation to longer sequences than trained on
ALiBi (Attention with Linear Biases) — adds a linear penalty to attention scores based on distance, implicitly creating position awareness

3. How Models Are Trained

3.1 The Pipeline

Data Collection

Crawl the internet: Common Crawl (petabytes), books, code (GitHub), Wikipedia, academic papers

Data Cleaning & Filtering

Deduplication, quality filtering (perplexity scoring), toxic content removal, PII redaction

Tokenization

Train a BPE tokenizer on the corpus, then encode all text into token sequences

Pre-training (Next Token Prediction)

Train the transformer on trillions of tokens to predict the next token. Months on thousands of GPUs.

Supervised Fine-Tuning (SFT)

Train on curated (prompt, ideal_response) pairs to teach conversational format and instruction-following

RLHF / DPO Alignment

Use human preference data to steer the model toward helpful, harmless, and honest responses

3.2 Pre-training — The Big Computation

The core pre-training objective is deceptively simple: predict the next token. Given a sequence of tokens $t_1, t_2, \dots, t_{n-1}$, maximise:

$\mathcal{L} = -\sum_{i=1}^{N} \log P(t_i | t_1, \dots, t_{i-1}; \theta)$

This is the cross-entropy loss. The entire model — billions of parameters $\theta$ — is optimised with AdamW gradient descent over trillions of tokens. The magic is that to predict the next token well, the model must implicitly learn grammar, facts, reasoning patterns, coding conventions, and more.

3.3 The Scale of Training

📊 Interactive: Training Cost Breakdown

How much does it cost to train a frontier model? Approximate breakdown for a hypothetical 400B model.

3.4 Distributed Training

No single GPU can hold a frontier model. Training is distributed across thousands of GPUs using parallelism strategies:

Data Parallelism

Each GPU has a copy of the model, processes different data batches, and gradients are averaged across GPUs. Simple but memory-hungry.

Tensor Parallelism

Split individual matrix multiplications across GPUs. A single layer's computation is shared between devices. Requires fast interconnect (NVLink).

Pipeline Parallelism

Different layers live on different GPUs. Data flows through a pipeline, with micro-batching to keep all GPUs busy.

Modern training frameworks (Megatron-LM, DeepSpeed, FSDP) combine all three. The Llama 3 405B was trained on 16,384 H100 GPUs simultaneously — an engineering feat as impressive as the model architecture itself.

3.5 Chinchilla Scaling Laws

DeepMind's Chinchilla paper (Hoffmann et al., 2022) showed that many large models were under-trained: given a fixed compute budget, it's better to train a smaller model on more data than a larger model on less data. The optimal ratio:

$\text{Optimal tokens} \approx 20 \times \text{parameters}$

A 70B model should train on ~1.4 trillion tokens. This shifted the industry toward higher-quality, more abundant data rather than just bigger models.

4. From Base Model to Assistant

A base model (pre-trained only on next-token prediction) is like a brilliant but uncooperative savant. It can complete text — but it won't follow instructions, answer questions, or refuse harmful requests. It just predicts what comes next in the training distribution.

Fine-tuning transforms it into a useful assistant.

4.1 Supervised Fine-Tuning (SFT)

Train the model on high-quality (instruction, response) pairs. Tens of thousands of examples covering diverse tasks: coding, math, creative writing, summarisation, multi-turn conversation. The model learns the format of helpful interaction.

"SFT teaches the model the form. RLHF teaches it the substance." — Common saying in the alignment community

4.2 LoRA — Low-Rank Adaptation

Full fine-tuning updates all parameters — expensive and requires as much memory as pre-training. LoRA (Hu et al., 2021) freezes the pre-trained weights and injects small trainable low-rank matrices into each layer:

$\mathbf{W}' = \mathbf{W} + \Delta\mathbf{W} = \mathbf{W} + \mathbf{B}\mathbf{A}$

Where $\mathbf{W} \in \mathbb{R}^{d \times d}$ is frozen, and $\mathbf{B} \in \mathbb{R}^{d \times r}$, $\mathbf{A} \in \mathbb{R}^{r \times d}$ with rank $r \ll d$ (typically 4–64). This means updating only $2dr$ parameters instead of $d^2$ — often 0.1% of the total.

📊 Interactive: LoRA Rank vs Parameters

See how LoRA rank affects the number of trainable parameters. For a model with $d = 4096$ hidden dimension (like Llama 7B), vary the rank to see the tradeoff.

LoRA Rank (r): 16 Hidden dim (d): 4096

4.3 RLHF and DPO

After SFT, the model follows instructions but may still produce harmful, biased, or unhelpful responses. RLHF (Reinforcement Learning from Human Feedback) aligns it with human preferences:

Collect comparisons — humans rank multiple model responses to the same prompt
Train a reward model — a neural network that predicts human preference scores
Optimise with PPO — use RL to make the LLM generate responses that score high on the reward model, with a KL penalty to prevent drifting too far from the SFT model

DPO (Direct Preference Optimisation, Rafailov et al., 2023) simplifies this: skip the reward model entirely and directly optimise the policy from preference pairs, treating it as a classification problem:

$\mathcal{L}_{\text{DPO}} = -\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$

Where $y_w$ is the preferred response and $y_l$ the dispreferred one. Simpler, more stable, and increasingly the default over PPO-based RLHF.

5. The Art of Serving Models

Training a model is expensive but happens once. Inference — running the model to generate responses — happens billions of times. Making inference fast, cheap, and memory-efficient is where much of the engineering effort happens.

5.1 The KV-Cache

Autoregressive generation produces one token at a time. Naively, each new token requires recomputing attention over all previous tokens — $O(n^2)$ per token. The KV-cache stores the Key and Value vectors from previous tokens so they don't need to be recomputed:

Prefill phase — process the entire prompt in parallel, compute all K/V vectors, store them
Decode phase — for each new token, compute only the new Q/K/V, attend to cached K/V

This turns generation from $O(n^2)$ to $O(n)$ per token — but the cache consumes memory. For a 70B model with 128K context, the KV-cache can be 50+ GB.

⚡ Interactive: KV-Cache Visualisation

Watch how the KV-cache grows as tokens are generated. Each row is an attention layer, each cell is a cached (K,V) pair for a token position.

Layers to show: 4

Tokens generated: 0 · Cache entries: 0 ·

5.2 Quantization

Pre-trained weights are stored as 16-bit floating-point (FP16/BF16). Quantization reduces precision to 8-bit, 4-bit, or even 2-bit integers, dramatically shrinking model size and speeding up inference:

Precision	Memory (70B model)	Quality Loss	Speed
FP16 (baseline)	~140 GB	None	1×
INT8 (GPTQ/AWQ)	~70 GB	Minimal	~1.5×
INT4 (GGUF Q4)	~35 GB	Small	~2×
INT2/3 (extreme)	~18 GB	Noticeable	~3×

📊 Interactive: Quantization Impact

See how reducing bit-width affects the representation of a weight value. The original FP16 value gets rounded to fewer levels, introducing quantization error.

Original value: 1.47

5.3 Speculative Decoding

Speculative decoding uses a small, fast "draft" model to generate several candidate tokens ahead, then the large model verifies them in one parallel forward pass. If the draft tokens are correct (they usually are for common text), you get multiple tokens for the cost of one large-model inference. Speed-ups of 2–3× are common.

5.4 Other Optimisations

Flash Attention (Dao et al., 2022) — reorders attention computation to minimise GPU memory reads/writes (IO-aware), achieving 2–4× speedup and enabling much longer contexts
Paged Attention (vLLM) — manages KV-cache memory like OS virtual memory with paging, eliminating fragmentation and enabling higher batch throughput
Mixture of Experts (MoE) — only activate a subset of parameters per token. Mixtral 8×7B has 47B total params but only 13B active per token — nearly as fast as a 13B model with near-47B quality
Continuous batching — dynamically add/remove requests from a batch as they finish, maximising GPU utilisation

6. Crafting Effective Prompts

The same model can produce wildly different outputs depending on how you ask. Prompt engineering is the practice of designing inputs that reliably elicit the desired behaviour — without changing the model's weights.

6.1 Key Techniques

System Prompts

Set the model's persona, constraints, and output format upfront. "You are a helpful coding assistant. Always include code examples. Never execute destructive commands."

Few-Shot Examples

Show the model 2–5 input/output examples before your actual question. The model learns the pattern and applies it to new inputs.

Chain-of-Thought

Add "Think step by step" or provide a worked example with reasoning. Dramatically improves math, logic, and complex task performance.

Structured Output

Request JSON, XML, or specific formats. Modern models support JSON mode and function calling for reliable structured output.

🎯 Interactive: Prompt Engineering Playground

Compare how different prompt strategies affect model output quality. Select a task and prompting approach to see simulated results.

6.2 Temperature and Sampling

When the model produces a probability distribution over the next token, the sampling strategy determines which token is actually chosen:

Temperature ($T$) — divides logits by $T$ before softmax. $T \to 0$ makes output deterministic (always pick the highest-probability token). $T > 1$ makes it more random.
Top-p (nucleus sampling) — only sample from the smallest set of tokens whose cumulative probability exceeds $p$. E.g., top-p=0.9 ignores the unlikely 10% tail.
Top-k — only consider the $k$ most probable tokens.

$P(t_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

For factual tasks, use low temperature (0.0–0.3). For creative writing, try 0.7–1.0. For brainstorming, go up to 1.2+.

7. Beyond Text

The transformer architecture is remarkably general — it processes sequences of tokens, and those tokens don't have to be text. By converting images, audio, and video into token-like representations, the same architecture handles multiple modalities.

7.1 Vision Transformers (ViT)

ViT (Dosovitskiy et al., 2020) splits an image into 16×16 patches, flattens each patch into a vector, and processes them through a standard transformer. The patch embeddings are the image's "tokens."

For multimodal LLMs (GPT-4V, Claude 3, Gemini), a vision encoder (often a CLIP ViT) converts the image into a sequence of embedding vectors that are concatenated with the text token embeddings and fed into the LLM together.

7.2 CLIP — Connecting Vision and Language

CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) trains two encoders — one for images, one for text — so that matching image-text pairs have similar embeddings while non-matching pairs are pushed apart:

$\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_i) / \tau)}{\sum_j \exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j) / \tau)}$

This creates a shared embedding space where "a photo of a dog" and an actual dog photo are nearby. CLIP powers image search, zero-shot classification, and image generation guidance.

7.3 Audio and Speech

Whisper (OpenAI, 2022) — encoder-decoder transformer trained on 680K hours of multilingual audio. Transcribes speech in 99 languages with near-human accuracy.
Audio tokens — models like AudioLM and MusicGen convert audio into discrete tokens using neural codecs (SoundStream, EnCodec), then model them as a token sequence.
GPT-4o real-time voice — processes audio natively (not speech→text→LLM→text→speech), enabling natural conversation with emotions, accents, and singing.

7.4 Video and Beyond

Video is just images over time. Sora (OpenAI, 2024) generates video by operating in a latent space-time patch representation — treating video as a 3D grid of tokens (spatial × temporal). This same approach scales to generate long, physically coherent scenes.

The trajectory is clear: everything becomes tokens. Text, images, audio, video, code, structured data, tool calls, robotic actions — all expressed as sequences that a unified transformer can process. This convergence is what makes "foundation models" truly foundational.

📊 Interactive: Multimodal Model Capabilities Timeline

The rapid expansion of what AI models can process and generate.

8. From Keystroke to Answer

Let's trace the complete journey when you type a question into a model:

Tokenization — your text is split into subword tokens by BPE. "What is photosynthesis?" → token IDs [3923, 374, 14767, 25247, 30]
Embedding + Positional Encoding — each token ID → 4096-dimensional vector, plus position information (RoPE rotation)
Prefill — all prompt tokens processed in parallel through N transformer layers. K/V vectors cached for every layer.
Decode Loop — for each output token:
- Compute Q for the new position, attend to cached K/V
- Pass through MLP layers
- Project to vocabulary logits (size 128K)
- Apply temperature and top-p sampling
- Output the selected token, cache its K/V
- Check for stop condition (EOS token, max length)
Detokenization — token IDs → text string, streamed to your screen

For a 70B model generating 100 tokens: ~7 trillion multiply-accumulate operations in the prefill phase, then ~70 billion per decode step. The KV-cache holds ~2GB for a 4K context. The whole thing takes 2–5 seconds on 8 GPUs.

That's the engineering stack behind the magic.

9. Wrapping Up — The Four-Part Arc

Over four posts, we've traced the complete story of "How Does AI Think?":

Part 1 — Logic to Statistics: From Ramon Llull's mechanical reasoning (1275) through Babbage, Turing, LISP, search algorithms, expert systems, the AI winters, and the revival through linear regression and gradient descent.
Part 2 — Perception to Prediction: The deep learning revolution — perceptrons, backpropagation, CNNs, RNNs, word embeddings, attention, transformers, BERT, GPT, ChatGPT, and the age of foundation models.
Part 3 — Action to Understanding: Reinforcement learning, game-playing AI, agents, tool use, chain-of-thought reasoning, the open-source revolution, alignment, interpretability, and the road to AGI.
Part 4 — Under the Hood: Tokenization, embeddings, the training pipeline, fine-tuning with LoRA and RLHF, inference optimisation, prompt engineering, and multimodal AI — the engineering that makes it all work.

The AI field is moving faster than any technology in human history. By the time you finish reading this, there's probably a new model release, a new benchmark record, or a new open-source breakthrough. But the fundamentals — tokens, transformers, training, and the interplay of scale and architecture — will remain relevant far longer than any individual model name.

The best way to understand AI isn't just to read about it. It's to build with it. Run Ollama on your laptop. Fine-tune a small model with LoRA. Write a RAG pipeline. Build an agent. The tools have never been more accessible.

The future of AI is being built right now — and you can be part of it.

Read Part 5: AI in the Wild — Applications & the Developer's Toolkit →