How Does AI Think? Part 2 — From Neural Nets to Transformers
📖 This is Part 2 of the "How Does AI Think?" series. ← Read Part 1: From Logic Machines to Linear Regression
Introduction — Picking Up Where We Left Off
In Part 1, we traced AI from 1275 to the late 1990s: mechanical reasoning, formal logic, symbolic AI, search algorithms, expert systems, and the statistical turn that introduced linear regression and gradient descent. We ended at the threshold of a revolution.
This post continues the story from 1998 to the present: multi-layer neural networks trained with backpropagation, the GPU-powered breakthrough of deep learning, convolutional nets that see, recurrent nets that remember, attention mechanisms that focus, and transformers — the architecture behind GPT, BERT, and every foundation model shaping the world today.
As before, you'll find interactive demos: visualise how CNN filters extract features, explore attention heatmaps, predict the next token, and watch a neural network learn in real time.
1. The Multi-Layer Perceptron — Stacking Neurons
A single perceptron draws one decision line. Stack them in layers and you get a Multi-Layer Perceptron (MLP) — also called a feedforward neural network. Each layer transforms the input through weights, biases, and a non-linear activation function, passing results forward to the next layer.
1.1 Architecture
A typical MLP has:
- Input layer — one neuron per feature (e.g., 784 for a 28×28 greyscale image)
- Hidden layers — 1 to many; each neuron computes $z = \sum w_i x_i + b$, then applies an activation $a = \sigma(z)$
- Output layer — one neuron per class (for classification), often with softmax: $p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$
Simple MLP: 4 → 6 → 6 → 3
1.2 Activation Functions — Why Non-linearity Matters
Without non-linear activations, any stack of linear layers collapses to a single linear transformation — you'd just get a fancier linear regression. The activation function introduces non-linearity, allowing the network to learn curves, edges, and complex patterns.
Key activation functions through history:
- Sigmoid (1990s default): $\sigma(z) = \frac{1}{1 + e^{-z}}$ — squashes to [0,1], but gradients vanish for large |z|
- Tanh: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ — zero-centred (better than sigmoid), but same vanishing gradient problem
- ReLU (Nair & Hinton, 2010): $\text{ReLU}(z) = \max(0, z)$ — simple, sparse, and trains 6× faster than sigmoid. The modern default.
- GELU (Hendrycks & Gimpel, 2016): $\text{GELU}(z) = z \cdot \Phi(z)$ — smooth approximation of ReLU, used in BERT and GPT.
📐 Interactive: Activation Function Explorer
See how different activation functions transform input values. The plot shows f(z) and its derivative f'(z).
1.3 Backpropagation — The Chain Rule at Scale
Training requires computing how each weight in the network contributes to the overall loss. Backpropagation does this efficiently by applying the chain rule of calculus backwards through the network.
For a network with loss $L$, output $\hat{y}$, hidden activation $h$ and weight $w$:
$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial w}$
This decomposition means we only need to compute each partial once and reuse it — the total cost is $O(n)$ where $n$ is the number of weights, compared to $O(n^2)$ for naive numerical differentiation. Without backprop, training deep networks would be computationally infeasible.
1.4 The Vanishing Gradient Problem
With sigmoid and tanh, gradients get multiplied through each layer during backprop. Since sigmoid's derivative peaks at 0.25, after 10 layers the gradient shrinks by a factor of $0.25^{10} \approx 10^{-6}$ — the early layers barely learn at all. This is the vanishing gradient problem, and it kept deep networks stuck at 2–3 effective layers for over a decade.
Solutions came from multiple directions:
- ReLU (2010) — gradient is 1 for positive inputs, avoiding the vanishing issue
- Batch Normalisation (Ioffe & Szegedy, 2015) — normalises layer inputs to zero mean, stabilising gradients
- Skip connections / Residual learning (He et al., 2015) — allows gradients to flow directly through identity shortcuts
- Better initialisations — Xavier (2010), Kaiming/He (2015) — set initial weights to preserve variance across layers
2. Convolutional Neural Networks — Teaching Machines to See
2.1 The Inspiration — Hubel & Wiesel (1959)
Neurophysiologists David Hubel and Torsten Wiesel discovered that cat visual cortex neurons respond to specific oriented edges in specific parts of the visual field. "Simple cells" detect edges; "complex cells" respond regardless of exact position. This hierarchy — local features → invariant representations — inspired the design of CNNs.
2.2 The Convolution Operation
Instead of connecting every pixel to every neuron (fully connected), a CNN slides a small filter (e.g. 3×3) across the image, computing a dot product at each position:
$(I * K)[i,j] = \sum_{m}\sum_{n} I[i+m,\, j+n] \cdot K[m,n]$
This has three critical advantages:
- Parameter sharing — the same 3×3 filter is used everywhere, so a layer with a 3×3 filter has only 9 weights (+ bias) instead of millions
- Translation equivariance — a cat in the top-left activates the same filter as a cat in the bottom-right
- Locality — early layers learn local features (edges, textures); deeper layers compose them into higher-level concepts (eyes, faces, objects)
🔍 Interactive: CNN Filter Visualiser
See what common 3×3 convolution filters detect when applied to a sample image. Each filter highlights different features — edges, corners, blurs.
2.3 LeNet-5 — The Pioneer (1998)
Yann LeCun and colleagues (Bell Labs) designed LeNet-5 for handwritten digit recognition (ZIP codes for the US Postal Service). The architecture:
- Input: 32×32 greyscale image
- Conv layer → 6 feature maps (5×5 filters) → average pooling
- Conv layer → 16 feature maps (5×5 filters) → average pooling
- 3 fully connected layers → 10 outputs (digits 0–9)
LeNet-5 achieved 99.2% accuracy on MNIST and was deployed by US banks to read checks — processing millions of cheques per day by the early 2000s. It proved CNNs could work in production, but the limited data and compute of the era prevented scaling to harder tasks.
2.4 AlexNet — The Moment Everything Changed (2012)
In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with AlexNet — and won by a staggering 10.8 percentage points over the runner-up (error: 15.3% vs 26.2%).
The secret ingredients:
- GPUs — trained on two GTX 580 GPUs with 3GB VRAM each; 5–6 days of training
- ReLU instead of sigmoid — 6× faster convergence
- Dropout — randomly zeroing 50% of neurons during training to prevent co-adaptation
- Data augmentation — random crops, flips, colour jittering for 1.2 million images
- Scale — 60 million parameters, 5 conv layers + 3 FC layers
AlexNet didn't invent any single technique, but it proved that scale + GPUs + deep CNNs could demolish hand-crafted feature engineering. The entire computer vision field pivoted to deep learning almost overnight.
📊 Interactive: ImageNet Error Rate Over Time
Watch the top-5 error rate plummet as architectures got deeper and more sophisticated. Human-level performance (~5.1%) was surpassed in 2015.
2.5 The Deeper We Go — VGG, Inception, ResNet
VGGNet (2014)
Showed that simply stacking 3×3 convolutions very deep (16–19 layers) with lots of parameters (138M) was competitive. Elegant but computationally expensive.
GoogLeNet / Inception (2014)
Used "Inception modules" — parallel convolutions at different scales (1×1, 3×3, 5×5) concatenated. 22 layers, but only 5M parameters — 12× fewer than VGG.
ResNet (2015)
Introduced skip connections: $y = F(x) + x$. Instead of learning a new function, each block learns a residual — the difference from identity. Enabled training of 152-layer (and even 1000+ layer) networks. Won ImageNet 2015 with 3.57% error.
EfficientNet (2019)
Systematically scaled depth, width, and resolution together using a compound coefficient. Achieved better accuracy than prior models with 8.4× fewer parameters.
3. Recurrent Neural Networks — Memory in Networks
3.1 The Basic RNN
CNNs excel at spatial patterns (images), but language and time-series have a sequential structure. A Recurrent Neural Network (RNN) processes one element at a time, maintaining a hidden state that carries information from previous steps:
$h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$
At each step, the hidden state $h_t$ is a function of the current input $x_t$ and the previous hidden state $h_{t-1}$. This creates a "loop" that in principle allows the network to remember information from arbitrarily far in the past.
In practice, basic RNNs suffer from the vanishing gradient problem even more acutely than feedforward networks: gradients shrink exponentially over long sequences, making it impossible to learn dependencies beyond ~20 steps.
3.2 LSTM — Long Short-Term Memory (Hochreiter & Schmidhuber, 1997)
The LSTM solves this with a gating mechanism: three gates (forget, input, output) control what information to keep, add, or expose from a cell state $C_t$ that flows through time like a conveyor belt:
- Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$ — what to erase from memory
- Input gate: $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$ — what new info to store
- Cell update: $C_t = f_t \odot C_{t-1} + i_t \odot \tanh(W_c [h_{t-1}, x_t] + b_c)$
- Output gate: $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$; $h_t = o_t \odot \tanh(C_t)$
The key insight: the cell state $C_t$ is updated with addition (not multiplication), so gradients can flow unchanged through many timesteps. This lets LSTMs learn dependencies over hundreds of steps — enabling breakthroughs in machine translation, speech recognition, and text generation.
3.3 GRU — Gated Recurrent Unit (Cho et al., 2014)
A simplified variant that merges the forget and input gates into a single "update gate" and combines the cell and hidden states. Fewer parameters, comparable performance, and faster to train:
$z_t = \sigma(W_z [h_{t-1}, x_t]), \quad r_t = \sigma(W_r [h_{t-1}, x_t])$
$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W [r_t \odot h_{t-1}, x_t])$
4. Word Embeddings — Meaning as Geometry
4.1 Word2Vec (Mikolov et al., 2013)
Before word embeddings, words were represented as one-hot vectors: "cat" = [0,0,...,1,...,0]. Every word was equally distant from every other word — no notion of similarity.
Word2Vec trained a shallow neural net to predict context words given a target (Skip-gram) or vice versa (CBOW). The result: each word gets a dense vector (~300 dimensions) where semantic similarity corresponds to geometric proximity:
This was astonishing: the network discovered (without supervision) that "royalty" and "gender" are independent axes in the learned vector space.
4.2 GloVe (Pennington et al., 2014)
GloVe (Global Vectors) took a different approach: instead of predicting context via a neural net, it factorised the word co-occurrence matrix directly. The result was similar quality embeddings based on the insight that the ratio of co-occurrence probabilities encodes meaning.
4.3 The Limits of Fixed Embeddings
Word2Vec and GloVe assign each word one vector regardless of context. But "bank" means something different in "river bank" vs "savings bank." This limitation would be solved by contextual embeddings — which require the attention mechanism we're about to discuss.
5. The Attention Mechanism
5.1 The Bottleneck Problem
In sequence-to-sequence models (e.g. English→French translation), the encoder RNN compresses the entire input sentence into a single fixed-length vector. For long sentences, this bottleneck loses information catastrophically.
5.2 Attention (Bahdanau et al., 2014)
The solution: instead of one summary vector, let the decoder look back at all encoder hidden states and compute a weighted average, focusing on the most relevant parts for each output word.
At each decoding step $t$, we compute attention weights $\alpha_{t,s}$ over all source positions $s$:
$e_{t,s} = a(h_t^{\text{dec}}, h_s^{\text{enc}}), \quad \alpha_{t,s} = \frac{\exp(e_{t,s})}{\sum_{s'}\exp(e_{t,s'})}$
The context vector is then: $c_t = \sum_s \alpha_{t,s} \cdot h_s^{\text{enc}}$
This is the birth of attention: a soft, differentiable address lookup that tells the model "to generate this word, focus on these parts of the input."
🔥 Interactive: Attention Heatmap Visualiser
This shows a simulated attention pattern for an English→French translation. Brighter = higher attention weight. Click a target word to see which source words it attends to.
Source (English) across top · Target (French) down left
6. "Attention Is All You Need" — The Transformer (2017)
6.1 The Key Insight
Vaswani et al. (Google, 2017) asked: what if we get rid of recurrence entirely and build the entire model from attention layers?
RNNs process tokens sequentially — token 50 must wait for tokens 1–49 to finish. This makes them slow on modern parallel hardware (GPUs/TPUs). The Transformer processes all positions simultaneously using self-attention, achieving massive parallelism.
6.2 Self-Attention — Scaled Dot-Product
Each input token is projected into three vectors: Query (Q), Key (K), and Value (V) via learned linear projections. Attention is then:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Intuitively: each query asks "what should I attend to?", the keys say "here's what I contain," the dot product measures relevance, softmax normalises to a probability distribution, and the result is a weighted sum of values.
The $\sqrt{d_k}$ scaling prevents dot products from getting too large (which would push softmax into regions with vanishing gradients).
6.3 Multi-Head Attention
Instead of one attention function, the Transformer runs $h$ attention heads in parallel, each with different learned projections. This lets the model attend to different types of relations simultaneously:
- Head 1 might learn syntactic relationships (subject ↔ verb)
- Head 2 might learn positional proximity
- Head 3 might learn coreference (pronouns ↔ referents)
$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$
6.4 Positional Encoding
Since the Transformer has no recurrence or convolution, it has no inherent notion of order. Position is injected via positional encodings added to the input embeddings:
$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$
These sinusoidal functions allow the model to extrapolate to sequence lengths longer than those seen during training (in theory), and each dimension oscillates at a different frequency, creating a unique "fingerprint" for each position.
6.5 The Full Transformer Architecture
The complete model stacks $N$ identical layers (typically 6–96), each containing:
- Multi-head self-attention
- Layer normalisation
- Position-wise feed-forward network (two linear layers with ReLU/GELU)
- Residual connections around each sub-layer
For encoder-decoder models (like the original Transformer for translation), the decoder additionally has cross-attention layers attending to encoder outputs, and causal masking to prevent attending to future tokens during generation.
7. The Foundation Model Era (2018–Present)
7.1 BERT — Bidirectional Representations (2018)
Google's BERT (Bidirectional Encoder Representations from Transformers) used only the encoder half of the Transformer. It was pre-trained with two tasks:
- Masked Language Model (MLM) — randomly mask 15% of tokens and predict them from context: "The [MASK] sat on the mat" → "cat"
- Next Sentence Prediction (NSP) — given two sentences, predict whether the second follows the first
BERT's key insight: by attending in both directions simultaneously, each token's representation is context-dependent — solving the "bank" ambiguity problem. BERT-Large has 340M parameters and was fine-tuned to achieve state-of-the-art on 11 NLP benchmarks simultaneously.
7.2 GPT — Generative Pre-training (2018–2024)
OpenAI's GPT (Generative Pre-trained Transformer) used only the decoder half, trained autoregressively — predict the next token given all previous tokens:
$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$
The GPT series scaled dramatically:
| Model | Year | Parameters | Training Data |
|---|---|---|---|
| GPT-1 | 2018 | 117M | BookCorpus (5GB) |
| GPT-2 | 2019 | 1.5B | WebText (40GB) |
| GPT-3 | 2020 | 175B | Filtered internet (570GB) |
| GPT-4 | 2023 | ~1.8T (rumoured MoE) | Undisclosed (multi-modal) |
7.3 The Scaling Hypothesis
Kaplan et al. (2020) published "Scaling Laws for Neural Language Models," showing that performance improves predictably as a power law of model size, dataset size, and compute — with no sign of saturation. This gave rise to the scaling hypothesis: maybe you don't need new architectures, just more scale.
📈 Interactive: Scaling Laws Visualiser
Drag the sliders to see how model performance (loss) changes with parameters and data. The relationship is a power law: $L \approx a \cdot N^{-\alpha}$ (roughly).
7.4 Instruction Tuning & RLHF
Raw GPT models are trained to predict the next token — they're excellent at completing text but don't naturally follow instructions or avoid harmful outputs. Two techniques bridge this gap:
- Instruction tuning (FLAN, InstructGPT) — fine-tune on (instruction, response) pairs curated by human annotators
- RLHF — Reinforcement Learning from Human Feedback — train a reward model from human preference comparisons, then use PPO (Proximal Policy Optimisation) to fine-tune the language model to maximise the predicted reward. This is what made ChatGPT dramatically more useful and aligned than raw GPT-3.
7.5 ChatGPT & The Mainstream Moment (Nov 2022)
ChatGPT (GPT-3.5 + RLHF) reached 100 million users in 2 months — the fastest-growing consumer application in history. It proved that transformer-based language models, properly aligned, could serve as general-purpose AI assistants for coding, writing, analysis, education, and creative work.
8. Beyond Language — Multi-Modal Models
8.1 Vision Transformers — ViT (2020)
Dosovitskiy et al. showed that you can chop an image into 16×16 patches, flatten them, add positional embeddings, and feed them into a standard Transformer encoder — no convolutions needed. With enough data (JFT-300M), ViT matched or surpassed CNNs on ImageNet. The Transformer had conquered vision too.
8.2 DALL-E, Stable Diffusion & Image Generation
DALL-E (OpenAI, 2021) used a transformer to generate images from text prompts. Diffusion models (DDPM, 2020; Stable Diffusion, 2022) learn to reverse a noise process:
$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sigma_t z$
Starting from pure noise, the model iteratively denoises to produce photorealistic images. Combined with CLIP-guided text conditioning, this enabled "text-to-image" generation that stunned the world.
8.3 Multi-Modal Foundation Models
GPT-4V, Gemini, and Claude can process both text and images (and increasingly audio, video, and code) within a single model. The convergence is clear: the Transformer architecture is becoming a universal computation engine for all modalities of information.
9. Next-Token Prediction — The Surprising Power of Simplicity
The core training objective of GPT is almost absurdly simple: given a sequence of tokens, predict the next one. Minimise cross-entropy loss:
$\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{
And yet, this objective — scaled to trillions of tokens and hundreds of billions of parameters —
produces systems that can write code, solve math, explain concepts, translate languages,
and reason about the world. The mystery of emergent capabilities from scale
remains one of the deepest open questions in AI.
Type a prompt and see a simulated next-token prediction with probability bars.
This uses a simple n-gram model (not a real transformer) to illustrate the concept. Models that can plan, use tools, browse the web, write & execute code,
and take multi-step actions autonomously. Chain-of-thought prompting, tree-of-thought search, and inference-time compute
scaling (o1, DeepSeek-R1) are teaching models to "think step by step." Models that natively handle text, images, audio, video, and code in a single architecture.
The line between "NLP" and "CV" has dissolved. Quantisation (4-bit, 2-bit), distillation, mixture-of-experts, and speculative decoding
are making frontier models faster and cheaper.
Across two posts we've covered an extraordinary intellectual journey:
The thread connecting it all is a single question: can machines think?
At each era, the answer shifts. Llull would say "they can combine concepts."
Babbage would say "they can compute." Turing would say "they can simulate anything computable."
The neats would say "they can reason with rules." The scruffies would say "they can learn from data."
The transformers would say "give me enough text and I'll learn… everything?"
We still don't have a definitive answer. But the tools are extraordinary, the progress is accelerating,
and the story — 750 years and counting — is far from over.
Read Part 3: Agents, Reasoning & the Road to AGI →
← Part 1: From Logic Machines to Linear Regression
· Part 3: Agents, Reasoning & the Road to AGI →
🎯 Interactive: Next-Token Predictor
10. The State of AI Today & What's Next
10.1 Current Frontiers
🤖 AI Agents
🧪 Reasoning
🌍 Multi-Modal
⚡ Efficiency
10.2 Open Questions
11. Timeline — The Deep Learning Era
Year
Milestone
Why It Matters
1997 LSTM Learned long-range dependencies in sequences 1998 LeNet-5 CNN for handwritten digit recognition, deployed commercially 2006 Deep Belief Networks Hinton's pre-training unlocked deep networks 2010 ReLU activation 6× faster training; became the default activation 2012 AlexNet wins ImageNet Deep learning's Big Bang — 10.8pp improvement 2013 Word2Vec Words as meaningful vectors; king − man + woman ≈ queen 2014 GRU / Attention / GANs Simplified recurrence; attention for translation; generative adversarial nets 2015 ResNet / Batch Norm Skip connections enabled 152+ layer networks 2016 AlphaGo beats Lee Sedol Deep RL conquers Go — 10^170 positions 2017 Transformer "Attention Is All You Need" — the architecture that changed everything 2018 BERT / GPT-1 Pre-training revolution: encode knowledge, then fine-tune 2019 GPT-2 / EfficientNet 1.5B params; "too dangerous to release" (then released) 2020 GPT-3 / ViT / Scaling Laws 175B params; transformers for vision; predictable improvement 2021 DALL-E / Codex / AlphaFold 2 Text-to-image; code generation; protein folding solved 2022 ChatGPT / Stable Diffusion AI goes mainstream — 100M users in 2 months 2023 GPT-4 / Claude / Gemini Multi-modal frontier models; AI reasoning 2024 o1 / Agents / Open-source surge Inference-time compute; Llama 3; DeepSeek; AI coding agents 2025–26 Agent era continues Autonomous software engineering, scientific discovery, and beyond
12. The Full Arc — From Llull to LLMs
Further Reading