Ivan Santoso - Blog/how-does-ai-think-part-2

📖 This is Part 2 of the "How Does AI Think?" series. ← Read Part 1: From Logic Machines to Linear Regression

Introduction — Picking Up Where We Left Off

In Part 1, we traced AI from 1275 to the late 1990s: mechanical reasoning, formal logic, symbolic AI, search algorithms, expert systems, and the statistical turn that introduced linear regression and gradient descent. We ended at the threshold of a revolution.

This post continues the story from 1998 to the present: multi-layer neural networks trained with backpropagation, the GPU-powered breakthrough of deep learning, convolutional nets that see, recurrent nets that remember, attention mechanisms that focus, and transformers — the architecture behind GPT, BERT, and every foundation model shaping the world today.

As before, you'll find interactive demos: visualise how CNN filters extract features, explore attention heatmaps, predict the next token, and watch a neural network learn in real time.

1. The Multi-Layer Perceptron — Stacking Neurons

A single perceptron draws one decision line. Stack them in layers and you get a Multi-Layer Perceptron (MLP) — also called a feedforward neural network. Each layer transforms the input through weights, biases, and a non-linear activation function, passing results forward to the next layer.

1.1 Architecture

A typical MLP has:

Input layer — one neuron per feature (e.g., 784 for a 28×28 greyscale image)
Hidden layers — 1 to many; each neuron computes $z = \sum w_i x_i + b$, then applies an activation $a = \sigma(z)$
Output layer — one neuron per class (for classification), often with softmax: $p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$

Simple MLP: 4 → 6 → 6 → 3

1.2 Activation Functions — Why Non-linearity Matters

Without non-linear activations, any stack of linear layers collapses to a single linear transformation — you'd just get a fancier linear regression. The activation function introduces non-linearity, allowing the network to learn curves, edges, and complex patterns.

Key activation functions through history:

Sigmoid (1990s default): $\sigma(z) = \frac{1}{1 + e^{-z}}$ — squashes to [0,1], but gradients vanish for large |z|
Tanh: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ — zero-centred (better than sigmoid), but same vanishing gradient problem
ReLU (Nair & Hinton, 2010): $\text{ReLU}(z) = \max(0, z)$ — simple, sparse, and trains 6× faster than sigmoid. The modern default.
GELU (Hendrycks & Gimpel, 2016): $\text{GELU}(z) = z \cdot \Phi(z)$ — smooth approximation of ReLU, used in BERT and GPT.

📐 Interactive: Activation Function Explorer

See how different activation functions transform input values. The plot shows f(z) and its derivative f'(z).

1.3 Backpropagation — The Chain Rule at Scale

Training requires computing how each weight in the network contributes to the overall loss. Backpropagation does this efficiently by applying the chain rule of calculus backwards through the network.

For a network with loss $L$, output $\hat{y}$, hidden activation $h$ and weight $w$:

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial w}$

This decomposition means we only need to compute each partial once and reuse it — the total cost is $O(n)$ where $n$ is the number of weights, compared to $O(n^2)$ for naive numerical differentiation. Without backprop, training deep networks would be computationally infeasible.

1.4 The Vanishing Gradient Problem

With sigmoid and tanh, gradients get multiplied through each layer during backprop. Since sigmoid's derivative peaks at 0.25, after 10 layers the gradient shrinks by a factor of $0.25^{10} \approx 10^{-6}$ — the early layers barely learn at all. This is the vanishing gradient problem, and it kept deep networks stuck at 2–3 effective layers for over a decade.

Solutions came from multiple directions:

ReLU (2010) — gradient is 1 for positive inputs, avoiding the vanishing issue
Batch Normalisation (Ioffe & Szegedy, 2015) — normalises layer inputs to zero mean, stabilising gradients
Skip connections / Residual learning (He et al., 2015) — allows gradients to flow directly through identity shortcuts
Better initialisations — Xavier (2010), Kaiming/He (2015) — set initial weights to preserve variance across layers

2. Convolutional Neural Networks — Teaching Machines to See

2.1 The Inspiration — Hubel & Wiesel (1959)

Neurophysiologists David Hubel and Torsten Wiesel discovered that cat visual cortex neurons respond to specific oriented edges in specific parts of the visual field. "Simple cells" detect edges; "complex cells" respond regardless of exact position. This hierarchy — local features → invariant representations — inspired the design of CNNs.

2.2 The Convolution Operation

Instead of connecting every pixel to every neuron (fully connected), a CNN slides a small filter (e.g. 3×3) across the image, computing a dot product at each position:

$(I * K)[i,j] = \sum_{m}\sum_{n} I[i+m,\, j+n] \cdot K[m,n]$

This has three critical advantages:

Parameter sharing — the same 3×3 filter is used everywhere, so a layer with a 3×3 filter has only 9 weights (+ bias) instead of millions
Translation equivariance — a cat in the top-left activates the same filter as a cat in the bottom-right
Locality — early layers learn local features (edges, textures); deeper layers compose them into higher-level concepts (eyes, faces, objects)

🔍 Interactive: CNN Filter Visualiser

See what common 3×3 convolution filters detect when applied to a sample image. Each filter highlights different features — edges, corners, blurs.

2.3 LeNet-5 — The Pioneer (1998)

Yann LeCun and colleagues (Bell Labs) designed LeNet-5 for handwritten digit recognition (ZIP codes for the US Postal Service). The architecture:

Input: 32×32 greyscale image
Conv layer → 6 feature maps (5×5 filters) → average pooling
Conv layer → 16 feature maps (5×5 filters) → average pooling
3 fully connected layers → 10 outputs (digits 0–9)

LeNet-5 achieved 99.2% accuracy on MNIST and was deployed by US banks to read checks — processing millions of cheques per day by the early 2000s. It proved CNNs could work in production, but the limited data and compute of the era prevented scaling to harder tasks.

2.4 AlexNet — The Moment Everything Changed (2012)

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with AlexNet — and won by a staggering 10.8 percentage points over the runner-up (error: 15.3% vs 26.2%).

The secret ingredients:

GPUs — trained on two GTX 580 GPUs with 3GB VRAM each; 5–6 days of training
ReLU instead of sigmoid — 6× faster convergence
Dropout — randomly zeroing 50% of neurons during training to prevent co-adaptation
Data augmentation — random crops, flips, colour jittering for 1.2 million images
Scale — 60 million parameters, 5 conv layers + 3 FC layers

AlexNet didn't invent any single technique, but it proved that scale + GPUs + deep CNNs could demolish hand-crafted feature engineering. The entire computer vision field pivoted to deep learning almost overnight.

📊 Interactive: ImageNet Error Rate Over Time

Watch the top-5 error rate plummet as architectures got deeper and more sophisticated. Human-level performance (~5.1%) was surpassed in 2015.

2.5 The Deeper We Go — VGG, Inception, ResNet

VGGNet (2014)

Showed that simply stacking 3×3 convolutions very deep (16–19 layers) with lots of parameters (138M) was competitive. Elegant but computationally expensive.

GoogLeNet / Inception (2014)

Used "Inception modules" — parallel convolutions at different scales (1×1, 3×3, 5×5) concatenated. 22 layers, but only 5M parameters — 12× fewer than VGG.

ResNet (2015)

Introduced skip connections: $y = F(x) + x$. Instead of learning a new function, each block learns a residual — the difference from identity. Enabled training of 152-layer (and even 1000+ layer) networks. Won ImageNet 2015 with 3.57% error.

EfficientNet (2019)

Systematically scaled depth, width, and resolution together using a compound coefficient. Achieved better accuracy than prior models with 8.4× fewer parameters.

3. Recurrent Neural Networks — Memory in Networks

3.1 The Basic RNN

CNNs excel at spatial patterns (images), but language and time-series have a sequential structure. A Recurrent Neural Network (RNN) processes one element at a time, maintaining a hidden state that carries information from previous steps:

$h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$

At each step, the hidden state $h_t$ is a function of the current input $x_t$ and the previous hidden state $h_{t-1}$. This creates a "loop" that in principle allows the network to remember information from arbitrarily far in the past.

In practice, basic RNNs suffer from the vanishing gradient problem even more acutely than feedforward networks: gradients shrink exponentially over long sequences, making it impossible to learn dependencies beyond ~20 steps.

3.2 LSTM — Long Short-Term Memory (Hochreiter & Schmidhuber, 1997)

The LSTM solves this with a gating mechanism: three gates (forget, input, output) control what information to keep, add, or expose from a cell state $C_t$ that flows through time like a conveyor belt:

Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$ — what to erase from memory
Input gate: $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$ — what new info to store
Cell update: $C_t = f_t \odot C_{t-1} + i_t \odot \tanh(W_c [h_{t-1}, x_t] + b_c)$
Output gate: $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$; $h_t = o_t \odot \tanh(C_t)$

The key insight: the cell state $C_t$ is updated with addition (not multiplication), so gradients can flow unchanged through many timesteps. This lets LSTMs learn dependencies over hundreds of steps — enabling breakthroughs in machine translation, speech recognition, and text generation.

3.3 GRU — Gated Recurrent Unit (Cho et al., 2014)

A simplified variant that merges the forget and input gates into a single "update gate" and combines the cell and hidden states. Fewer parameters, comparable performance, and faster to train:

$z_t = \sigma(W_z [h_{t-1}, x_t]), \quad r_t = \sigma(W_r [h_{t-1}, x_t])$

$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W [r_t \odot h_{t-1}, x_t])$

4. Word Embeddings — Meaning as Geometry

4.1 Word2Vec (Mikolov et al., 2013)

Before word embeddings, words were represented as one-hot vectors: "cat" = [0,0,...,1,...,0]. Every word was equally distant from every other word — no notion of similarity.

Word2Vec trained a shallow neural net to predict context words given a target (Skip-gram) or vice versa (CBOW). The result: each word gets a dense vector (~300 dimensions) where semantic similarity corresponds to geometric proximity:

$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$

This was astonishing: the network discovered (without supervision) that "royalty" and "gender" are independent axes in the learned vector space.

4.2 GloVe (Pennington et al., 2014)

GloVe (Global Vectors) took a different approach: instead of predicting context via a neural net, it factorised the word co-occurrence matrix directly. The result was similar quality embeddings based on the insight that the ratio of co-occurrence probabilities encodes meaning.

4.3 The Limits of Fixed Embeddings

Word2Vec and GloVe assign each word one vector regardless of context. But "bank" means something different in "river bank" vs "savings bank." This limitation would be solved by contextual embeddings — which require the attention mechanism we're about to discuss.

5. The Attention Mechanism

5.1 The Bottleneck Problem

In sequence-to-sequence models (e.g. English→French translation), the encoder RNN compresses the entire input sentence into a single fixed-length vector. For long sentences, this bottleneck loses information catastrophically.

5.2 Attention (Bahdanau et al., 2014)

The solution: instead of one summary vector, let the decoder look back at all encoder hidden states and compute a weighted average, focusing on the most relevant parts for each output word.

At each decoding step $t$, we compute attention weights $\alpha_{t,s}$ over all source positions $s$:

$e_{t,s} = a(h_t^{\text{dec}}, h_s^{\text{enc}}), \quad \alpha_{t,s} = \frac{\exp(e_{t,s})}{\sum_{s'}\exp(e_{t,s'})}$

The context vector is then: $c_t = \sum_s \alpha_{t,s} \cdot h_s^{\text{enc}}$

This is the birth of attention: a soft, differentiable address lookup that tells the model "to generate this word, focus on these parts of the input."

🔥 Interactive: Attention Heatmap Visualiser

This shows a simulated attention pattern for an English→French translation. Brighter = higher attention weight. Click a target word to see which source words it attends to.

Source (English) across top · Target (French) down left

6. "Attention Is All You Need" — The Transformer (2017)

6.1 The Key Insight

Vaswani et al. (Google, 2017) asked: what if we get rid of recurrence entirely and build the entire model from attention layers?

RNNs process tokens sequentially — token 50 must wait for tokens 1–49 to finish. This makes them slow on modern parallel hardware (GPUs/TPUs). The Transformer processes all positions simultaneously using self-attention, achieving massive parallelism.

6.2 Self-Attention — Scaled Dot-Product

Each input token is projected into three vectors: Query (Q), Key (K), and Value (V) via learned linear projections. Attention is then:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Intuitively: each query asks "what should I attend to?", the keys say "here's what I contain," the dot product measures relevance, softmax normalises to a probability distribution, and the result is a weighted sum of values.

The $\sqrt{d_k}$ scaling prevents dot products from getting too large (which would push softmax into regions with vanishing gradients).

6.3 Multi-Head Attention

Instead of one attention function, the Transformer runs $h$ attention heads in parallel, each with different learned projections. This lets the model attend to different types of relations simultaneously:

Head 1 might learn syntactic relationships (subject ↔ verb)
Head 2 might learn positional proximity
Head 3 might learn coreference (pronouns ↔ referents)

$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

6.4 Positional Encoding

Since the Transformer has no recurrence or convolution, it has no inherent notion of order. Position is injected via positional encodings added to the input embeddings:

$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$

These sinusoidal functions allow the model to extrapolate to sequence lengths longer than those seen during training (in theory), and each dimension oscillates at a different frequency, creating a unique "fingerprint" for each position.

6.5 The Full Transformer Architecture

The complete model stacks $N$ identical layers (typically 6–96), each containing:

Multi-head self-attention
Layer normalisation
Position-wise feed-forward network (two linear layers with ReLU/GELU)
Residual connections around each sub-layer

For encoder-decoder models (like the original Transformer for translation), the decoder additionally has cross-attention layers attending to encoder outputs, and causal masking to prevent attending to future tokens during generation.

7. The Foundation Model Era (2018–Present)

7.1 BERT — Bidirectional Representations (2018)

Google's BERT (Bidirectional Encoder Representations from Transformers) used only the encoder half of the Transformer. It was pre-trained with two tasks:

Masked Language Model (MLM) — randomly mask 15% of tokens and predict them from context: "The [MASK] sat on the mat" → "cat"
Next Sentence Prediction (NSP) — given two sentences, predict whether the second follows the first

BERT's key insight: by attending in both directions simultaneously, each token's representation is context-dependent — solving the "bank" ambiguity problem. BERT-Large has 340M parameters and was fine-tuned to achieve state-of-the-art on 11 NLP benchmarks simultaneously.

7.2 GPT — Generative Pre-training (2018–2024)

OpenAI's GPT (Generative Pre-trained Transformer) used only the decoder half, trained autoregressively — predict the next token given all previous tokens:

$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$

The GPT series scaled dramatically:

Model	Year	Parameters	Training Data
GPT-1	2018	117M	BookCorpus (5GB)
GPT-2	2019	1.5B	WebText (40GB)
GPT-3	2020	175B	Filtered internet (570GB)
GPT-4	2023	~1.8T (rumoured MoE)	Undisclosed (multi-modal)

7.3 The Scaling Hypothesis

Kaplan et al. (2020) published "Scaling Laws for Neural Language Models," showing that performance improves predictably as a power law of model size, dataset size, and compute — with no sign of saturation. This gave rise to the scaling hypothesis: maybe you don't need new architectures, just more scale.

"Performance depends strongly on scale, weakly on model shape." — Scaling Laws paper, OpenAI (2020)

📈 Interactive: Scaling Laws Visualiser

Drag the sliders to see how model performance (loss) changes with parameters and data. The relationship is a power law: $L \approx a \cdot N^{-\alpha}$ (roughly).

Parameters (log₁₀): 10⁹ Data tokens (log₁₀): 10¹⁰

—

Estimated Cross-Entropy Loss (nats)

7.4 Instruction Tuning & RLHF

Raw GPT models are trained to predict the next token — they're excellent at completing text but don't naturally follow instructions or avoid harmful outputs. Two techniques bridge this gap:

Instruction tuning (FLAN, InstructGPT) — fine-tune on (instruction, response) pairs curated by human annotators
RLHF — Reinforcement Learning from Human Feedback — train a reward model from human preference comparisons, then use PPO (Proximal Policy Optimisation) to fine-tune the language model to maximise the predicted reward. This is what made ChatGPT dramatically more useful and aligned than raw GPT-3.

7.5 ChatGPT & The Mainstream Moment (Nov 2022)

ChatGPT (GPT-3.5 + RLHF) reached 100 million users in 2 months — the fastest-growing consumer application in history. It proved that transformer-based language models, properly aligned, could serve as general-purpose AI assistants for coding, writing, analysis, education, and creative work.

8. Beyond Language — Multi-Modal Models

8.1 Vision Transformers — ViT (2020)

Dosovitskiy et al. showed that you can chop an image into 16×16 patches, flatten them, add positional embeddings, and feed them into a standard Transformer encoder — no convolutions needed. With enough data (JFT-300M), ViT matched or surpassed CNNs on ImageNet. The Transformer had conquered vision too.

8.2 DALL-E, Stable Diffusion & Image Generation

DALL-E (OpenAI, 2021) used a transformer to generate images from text prompts. Diffusion models (DDPM, 2020; Stable Diffusion, 2022) learn to reverse a noise process:

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sigma_t z$

Starting from pure noise, the model iteratively denoises to produce photorealistic images. Combined with CLIP-guided text conditioning, this enabled "text-to-image" generation that stunned the world.

8.3 Multi-Modal Foundation Models

GPT-4V, Gemini, and Claude can process both text and images (and increasingly audio, video, and code) within a single model. The convergence is clear: the Transformer architecture is becoming a universal computation engine for all modalities of information.

9. Next-Token Prediction — The Surprising Power of Simplicity

The core training objective of GPT is almost absurdly simple: given a sequence of tokens, predict the next one. Minimise cross-entropy loss:

$\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{

And yet, this objective — scaled to trillions of tokens and hundreds of billions of parameters — produces systems that can write code, solve math, explain concepts, translate languages, and reason about the world. The mystery of emergent capabilities from scale remains one of the deepest open questions in AI.

🎯 Interactive: Next-Token Predictor

Type a prompt and see a simulated next-token prediction with probability bars. This uses a simple n-gram model (not a real transformer) to illustrate the concept.

10. The State of AI Today & What's Next

10.1 Current Frontiers

🤖 AI Agents

Models that can plan, use tools, browse the web, write & execute code, and take multi-step actions autonomously.

🧪 Reasoning

Chain-of-thought prompting, tree-of-thought search, and inference-time compute scaling (o1, DeepSeek-R1) are teaching models to "think step by step."

🌍 Multi-Modal

Models that natively handle text, images, audio, video, and code in a single architecture. The line between "NLP" and "CV" has dissolved.

⚡ Efficiency

Quantisation (4-bit, 2-bit), distillation, mixture-of-experts, and speculative decoding are making frontier models faster and cheaper.

10.2 Open Questions

Understanding: Do large language models "understand" anything, or are they sophisticated pattern matchers? (Stochastic parrots vs. emergent world models)
Alignment: How do we ensure increasingly capable AI systems pursue human values?
Scaling limits: Will the power-law scaling continue? What happens when we run out of internet text?
Reasoning: Can next-token prediction give rise to genuine logical reasoning, or do we need fundamentally new approaches?
AGI: Are transformers the architecture that reaches artificial general intelligence, or just a stepping stone?

11. Timeline — The Deep Learning Era

Year	Milestone	Why It Matters
1997	LSTM	Learned long-range dependencies in sequences
1998	LeNet-5	CNN for handwritten digit recognition, deployed commercially
2006	Deep Belief Networks	Hinton's pre-training unlocked deep networks
2010	ReLU activation	6× faster training; became the default activation
2012	AlexNet wins ImageNet	Deep learning's Big Bang — 10.8pp improvement
2013	Word2Vec	Words as meaningful vectors; king − man + woman ≈ queen
2014	GRU / Attention / GANs	Simplified recurrence; attention for translation; generative adversarial nets
2015	ResNet / Batch Norm	Skip connections enabled 152+ layer networks
2016	AlphaGo beats Lee Sedol	Deep RL conquers Go — 10^170 positions
2017	Transformer	"Attention Is All You Need" — the architecture that changed everything
2018	BERT / GPT-1	Pre-training revolution: encode knowledge, then fine-tune
2019	GPT-2 / EfficientNet	1.5B params; "too dangerous to release" (then released)
2020	GPT-3 / ViT / Scaling Laws	175B params; transformers for vision; predictable improvement
2021	DALL-E / Codex / AlphaFold 2	Text-to-image; code generation; protein folding solved
2022	ChatGPT / Stable Diffusion	AI goes mainstream — 100M users in 2 months
2023	GPT-4 / Claude / Gemini	Multi-modal frontier models; AI reasoning
2024	o1 / Agents / Open-source surge	Inference-time compute; Llama 3; DeepSeek; AI coding agents
2025–26	Agent era continues	Autonomous software engineering, scientific discovery, and beyond

12. The Full Arc — From Llull to LLMs

Across two posts we've covered an extraordinary intellectual journey:

Part 1: Logic machines → formal logic → search algorithms → expert systems → statistical methods → linear regression & gradient descent
Part 2: Multi-layer networks → CNNs → RNNs/LSTMs → word embeddings → attention → transformers → GPT → foundation models

The thread connecting it all is a single question: can machines think?

At each era, the answer shifts. Llull would say "they can combine concepts." Babbage would say "they can compute." Turing would say "they can simulate anything computable." The neats would say "they can reason with rules." The scruffies would say "they can learn from data." The transformers would say "give me enough text and I'll learn… everything?"

We still don't have a definitive answer. But the tools are extraordinary, the progress is accelerating, and the story — 750 years and counting — is far from over. Read Part 3: Agents, Reasoning & the Road to AGI →