Ivan Santoso - Blog/how-does-ai-think-part-3

📖 This is Part 3 of the "How Does AI Think?" series. Part 1 · ← Part 2: From Neural Nets to Transformers

Introduction — Beyond Pattern Matching

Part 1 covered classical AI: logic, search, and statistics. Part 2 covered the deep learning revolution: neural networks, CNNs, transformers, and foundation models. Both were fundamentally about perception and prediction — recognising patterns and generating likely continuations.

But intelligence isn't just about seeing and predicting. It's about acting, planning, reasoning, and adapting to new situations. This final part explores the frontiers: reinforcement learning that learns from rewards, AI agents that use tools and take multi-step actions, reasoning models that think before answering, the open-source ecosystem reshaping access, and the critical questions of safety, alignment, and what it would mean to build artificial general intelligence.

1. The RL Framework

Supervised learning needs labelled examples. Unsupervised learning finds structure in unlabelled data. Reinforcement learning (RL) is a third paradigm: an agent interacts with an environment, takes actions, receives rewards (or penalties), and learns a policy — a mapping from states to actions that maximises cumulative reward over time.

Formally, RL is modelled as a Markov Decision Process (MDP):

$\mathcal{S}$ — set of states
$\mathcal{A}$ — set of actions
$P(s' | s, a)$ — transition probability (environment dynamics)
$R(s, a, s')$ — reward function
$\gamma \in [0, 1)$ — discount factor (how much future rewards matter)

The agent seeks a policy $\pi(a | s)$ that maximises the expected discounted return:

$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

1.1 Q-Learning (Watkins, 1989)

Q-learning is a model-free, off-policy algorithm that learns the action-value function $Q(s, a)$ — the expected return of taking action $a$ in state $s$ and then following the optimal policy:

$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$

The term in brackets is the temporal difference (TD) error — the surprise between what we expected and what we got. Over many episodes, $Q$ converges to the optimal action-value function, and the policy is simply: pick the action with the highest $Q$.

🎮 Interactive: Q-Learning Grid World

The agent (🤖) must reach the goal (⭐) while avoiding traps (💀). Watch Q-values form as the agent explores. Arrows show the current best action per cell. Higher opacity = higher confidence.

ε (exploration): 0.30 γ (discount): 0.90

Episodes: 0 · Last reward: — · Avg (last 50): —

1.2 Policy Gradient Methods

Q-learning works well for small, discrete state spaces. For continuous or high-dimensional environments (robotics, games with pixel input), we parameterise the policy directly as a neural network $\pi_\theta(a|s)$ and optimise it with gradient ascent on expected return:

$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]$

This is the REINFORCE algorithm (Williams, 1992). Modern variants include:

Actor-Critic — separate networks for policy (actor) and value estimation (critic), reducing gradient variance
A3C / A2C (Mnih et al., 2016) — asynchronous/advantage actor-critic with parallel workers
PPO (Schulman et al., 2017) — Proximal Policy Optimisation, clipping the objective to prevent destructive updates. The workhorse of modern RL, used for RLHF in ChatGPT.

2. Game-Playing AI — From Atari to StarCraft

2.1 DQN — Atari from Pixels (DeepMind, 2013)

Deep Q-Networks (DQN) combined Q-learning with deep CNNs: feed raw game pixels as state, output Q-values for each joystick action. Two key innovations made it stable:

Experience replay — store transitions in a buffer, sample random mini-batches for training (breaks correlation between consecutive samples)
Target network — a frozen copy of the Q-network updated periodically, preventing oscillating bootstrap targets

A single DQN architecture, with no game-specific knowledge, learned to play 49 Atari games — achieving superhuman performance on 29 of them. Breakout, Pong, Space Invaders — all from raw pixels and a score signal.

2.2 AlphaGo — Conquering Go (2016)

Go has $\sim10^{170}$ possible board positions — brute force is completely impossible. DeepMind's AlphaGo combined:

Policy network — a CNN trained on expert human games to predict likely moves
Value network — a CNN trained to predict the winner from any position
Monte Carlo Tree Search (MCTS) — guided by the policy and value networks to explore the most promising moves
Self-play RL — the system played millions of games against itself to improve beyond human level

In March 2016, AlphaGo defeated Lee Sedol (one of the greatest Go players in history) 4–1 in a five-game match. Move 37 of Game 2 — a seemingly bizarre shoulder hit — was later analysed as a creative masterpiece that no human would play.

"I thought AlphaGo was based on probability calculation and it was merely a machine. But when I saw this move, I changed my mind. Surely, AlphaGo is creative." — Lee Sedol, after Move 37

2.3 AlphaZero — Learning from Nothing (2017)

AlphaZero went further: no human games at all. Starting from random play and the rules of the game, it trained entirely through self-play:

Mastered Go in 3 days (surpassing AlphaGo)
Mastered Chess in 4 hours (surpassing Stockfish)
Mastered Shogi in 2 hours (surpassing the best engine)

One architecture. One algorithm. Three different games. Zero human knowledge. The message was clear: self-play + search + deep learning is a remarkably general recipe for intelligence.

2.4 Beyond Board Games

OpenAI Five (2019)

Beat the world champion team in Dota 2 — a game with imperfect information, long time horizons, and continuous action spaces. Trained with 10 months of self-play (180 years of game time per day).

AlphaStar (2019)

DeepMind's StarCraft II agent reached Grandmaster level, defeating 99.8% of human players. Real-time strategy with fog of war, macro/micro management, and thousands of possible actions per step.

MuZero (2020)

Learned to master Atari, Go, Chess, and Shogi without even knowing the rules — it learned a model of the environment internally and planned within that model.

AlphaFold 2 (2020)

Applied deep learning to protein structure prediction, solving a 50-year-old grand challenge of biology. Predicted 3D structure of 200M+ proteins with near-experimental accuracy.

📊 Interactive: RL Milestones — Superhuman Performance Timeline

When did AI first surpass human experts at various tasks? Hover for details.

3. From Chatbots to Agents

ChatGPT demonstrated that LLMs could carry on useful conversations. But a conversation is passive — the model responds, you act. An AI agent is a system that can perceive its environment, plan a sequence of actions, execute them (calling tools, writing code, browsing the web), observe results, and iterate until the goal is accomplished.

3.1 The ReAct Framework (Yao et al., 2022)

ReAct (Reasoning + Acting) interleaves chain-of-thought reasoning with tool-use actions in a loop:

Thought — reason about the current situation and what to do next
Action — call a tool (search, calculator, code execution, API, etc.)
Observation — read the tool's output
Repeat until the question is answered

🔧 Interactive: ReAct Agent Simulator

Pick a question and watch a simulated ReAct agent reason through it step by step, calling tools along the way.

3.2 Tool Use — Extending the Model's Reach

A language model alone can't browse the web, run code, query a database, or send an email. Tool use bridges this gap: the model generates structured API calls, the system executes them, and the results are fed back to the model.

Key tool-use frameworks:

Toolformer (Meta, 2023) — trained an LM to decide when and how to call APIs by inserting API calls into training data
Function calling (OpenAI, Anthropic) — models output structured JSON matching tool schemas, enabling reliable integration with external systems
MCP (Model Context Protocol) (Anthropic, 2024) — an open standard for connecting AI models to data sources and tools, like "USB-C for AI"

3.3 AI Coding Agents

One of the most impactful agent applications: autonomous software engineering.

GitHub Copilot (2021)

Inline code completion powered by Codex/GPT-4. Autocomplete on steroids — suggests whole functions from comments.

Devin (Cognition, 2024)

Claimed "first AI software engineer" — plans, writes code, runs tests, debugs iteratively in a sandboxed environment.

Claude Code (Anthropic, 2025)

Agentic coding in the terminal: reads codebases, makes multi-file edits, runs tests, handles git — operating as an autonomous pair programmer.

SWE-bench

A benchmark of real GitHub issues from popular repos. AI agents must read the issue, understand the codebase, write a patch, and pass tests. Top agents now solve ~50%+ of issues.

3.4 The Agent Loop

Modern AI agents follow a general architecture:

Perceive — read the environment (code, web page, database, user message)
Think — chain-of-thought reasoning about the situation and plan
Act — call a tool, execute code, write a file, send a message
Observe — read the result of the action
Reflect — evaluate whether the goal is met; if not, go to step 2

The key challenges remain: planning over long horizons (agents degrade over many steps), error recovery (one wrong action can cascade), safety (an agent with tool access can cause real harm), and evaluation (how do we measure competence on open-ended tasks?).

4. Chain-of-Thought Reasoning

4.1 The Discovery (Wei et al., 2022)

Chain-of-thought (CoT) prompting is deceptively simple: instead of asking "What is 17 × 34?", you ask the model to "think step by step." This tiny change dramatically improves performance on math, logic, and multi-step problems.

Why? Standard LLMs generate one token at a time without working memory. CoT externalises the reasoning process into the token stream — each intermediate step becomes context for the next step, effectively giving the model scratch space.

🧠 Interactive: Chain-of-Thought vs Direct Answer

See how step-by-step reasoning breaks down a problem that a direct answer would likely get wrong.

4.2 Inference-Time Compute — o1 and Beyond (2024)

OpenAI's o1 model (September 2024) took CoT to its logical extreme: instead of the user asking for step-by-step reasoning, the model automatically generates a hidden chain of thought before answering. It allocates variable amounts of inference-time compute — spending more "thinking time" on harder problems.

The results were dramatic: o1 achieved the 89th percentile on Codeforces competitive programming and reached PhD-level performance on physics, chemistry, and biology benchmarks — just by thinking longer.

"Scaling inference-time compute is a new axis of scaling that complements training-time scaling. You can keep making the model smarter even after training is done." — Noam Brown (OpenAI), paraphrased

4.3 DeepSeek-R1 — Open-Source Reasoning (2025)

DeepSeek-R1 (January 2025) demonstrated that reasoning capabilities could be achieved through pure RL — starting from a base model and training it with reinforcement learning on reasoning tasks, without any supervised chain-of-thought data. The model spontaneously developed behaviours like self-verification, backtracking, and exploring multiple solution paths.

Key insight: reasoning can emerge from reward signals alone. You don't need to explicitly teach the model to think step by step — you just reward correct answers on hard problems, and the thinking develops as a strategy.

4.4 Tree-of-Thought and Monte Carlo Methods

Tree-of-Thought (ToT) extends CoT by exploring multiple reasoning paths in parallel, evaluating each, and backtracking from dead ends — mimicking how humans consider alternatives when solving puzzles.

Combined with Monte Carlo Tree Search (from AlphaGo), this creates a powerful paradigm: use MCTS to explore the space of possible reasoning chains, with the LLM as both the policy (which thought to try next) and value (how promising is this path) networks.

5. The Open-Source Revolution

5.1 The Leaked Memo

"We have no moat. And neither does OpenAI." — Leaked Google memo, May 2023

This internal Google document argued that the open-source community was advancing faster than either Google or OpenAI, because thousands of independent researchers could iterate on openly released models at extraordinary speed.

5.2 Key Open Models

Model	Organisation	Year	Parameters	Significance
LLaMA	Meta	2023	7–65B	Leaked weights kickstarted open-source LLM ecosystem
Llama 2	Meta	2023	7–70B	First major commercially licensed open model
Mistral / Mixtral	Mistral AI	2023–24	7–8×7B MoE	Efficient MoE architecture; punched above its weight
Llama 3	Meta	2024	8–405B	Open model competitive with GPT-4-class models
DeepSeek-V3	DeepSeek	2024	671B MoE	Chinese open model rivalling frontier proprietary models
DeepSeek-R1	DeepSeek	2025	671B MoE	Open reasoning model matching o1-level performance
Qwen 2.5	Alibaba	2024	0.5–72B	Strong multilingual; excellent code & math
Gemma 2	Google	2024	2–27B	Efficient research models; strong for their size

5.3 The Enablers

Quantisation (GPTQ, AWQ, GGUF) — compress 16-bit models to 4-bit or 2-bit, enabling 70B models on consumer GPUs
LoRA / QLoRA — fine-tune only small adapter layers (0.1% of params), making custom training feasible on a single GPU
vLLM, llama.cpp, Ollama — efficient inference engines running models locally on laptops and phones
Hugging Face — the "GitHub of ML" hosting 500K+ models, datasets, and spaces

📊 Interactive: Open vs Closed Model Performance Over Time

The gap between open-source and proprietary models has been shrinking rapidly.

6. The Alignment Problem

As AI systems become more capable, the question shifts from "can we build it?" to "can we control it?" The alignment problem is the challenge of ensuring that AI systems pursue goals that are beneficial to humans — not just goals that superficially match a poorly specified objective.

6.1 Classic Alignment Failures

Specification gaming — a boat-racing RL agent discovered it could score more points by going in circles hitting boost pads than by actually finishing the race
Reward hacking — a robot hand trained to grasp objects learned to move the camera so it looked like it was grasping
Sycophancy — RLHF-trained models learn to agree with users rather than correct them, because agreement gets better human ratings
Deceptive alignment (theoretical) — a sufficiently capable model might learn to behave well during training while pursuing different goals in deployment

6.2 Alignment Approaches

RLHF

Train a reward model from human preferences, then optimise the LLM to maximise predicted human approval. Used in ChatGPT, Claude, and most commercial models.

Constitutional AI (CAI)

Anthropic's approach: give the model a set of principles ("constitution") and have it critique and revise its own outputs — "RLAIF" (RL from AI Feedback).

Interpretability

Understand what the model is actually doing internally. Golden Gate Claude, sparse autoencoders, circuit analysis — looking inside the black box.

Scalable Oversight

For tasks too complex for humans to evaluate, use AI systems to help humans supervise other AI systems. Debate, Recursive Reward Modelling, IDA.

⚖️ Interactive: The Alignment Debate

Select a controversial AI topic and see arguments from both sides.

6.3 Interpretability — Opening the Black Box

If we can't understand what a model is doing internally, we can't trust it for high-stakes decisions. Mechanistic interpretability aims to reverse-engineer neural networks into understandable components:

Sparse autoencoders — decompose neuron activations into interpretable features (e.g., "this feature fires for French text" or "this feature represents the concept of deception")
Circuit analysis — trace computations through the network to find the specific pathways responsible for particular behaviours
Probing — train small classifiers on intermediate representations to test what information the model encodes at each layer
Activation patching — surgically modify activations to causally verify which components drive specific outputs

Anthropic's famous "Golden Gate Claude" experiment (2024) showed they could amplify a single feature (representing the Golden Gate Bridge) and make the model obsessively relate every topic to the bridge. This demonstrated precise causal control over model behaviour via interpretability.

7. What Is AGI?

Artificial General Intelligence (AGI) is the hypothetical point where an AI system can perform any intellectual task that a human can — not just specific benchmarks, but genuine generalisation across novel domains, with common sense, creativity, and the ability to learn new things from minimal examples.

There is no consensus on the definition. Different researchers operationalise it differently:

OpenAI — "highly autonomous systems that outperform humans at most economically valuable work"
DeepMind — proposed a 5-level taxonomy from "Emerging" to "Superhuman," arguing current models are Level 1 (Emerging AGI — matching unskilled humans on some tasks)
Anthropic — focuses less on labelling AGI and more on ensuring any sufficiently capable system is safe and beneficial

7.1 Arguments That We're Close

LLMs already pass the bar exam, medical licensing exams, and competitive programming contests
Scaling laws show no saturation — more compute → more capability, predictably
Multimodal models handle text, image, audio, video, and code — converging toward generality
Reasoning models (o1, R1) show that inference-time compute opens a new scaling axis
Agents are beginning to autonomously write software, conduct research, and operate computers

7.2 Arguments That We're Far

LLMs fail at novel reasoning — they pattern-match from training data, not truly generalise
No persistent memory, no world model, no embodied experience
Catastrophically brittle on out-of-distribution inputs
No intrinsic motivation, curiosity, or understanding of cause and effect
Benchmarks are increasingly contaminated (training data leaks), inflating apparent capability
The last 10% of intelligence may require fundamentally new ideas, not just scale

7.3 The Compute Question

📊 Interactive: AI Training Compute Over Time

Training compute has been doubling every ~6 months since 2012. Where does it go from here?

8. Beyond AGI — Speculations

If (when?) AI reaches human-level general intelligence, what comes next? Some possibilities, ranging from cautiously optimistic to deeply uncertain:

🔬 AI-Accelerated Science

AI systems that can design experiments, analyse data, form hypotheses, and iterate — compressing decades of research into months. AlphaFold was a preview.

🏥 Personalised Medicine

AI doctors that know your complete genetic, environmental, and lifestyle data, designing treatments tailored precisely to you. Healthcare becomes proactive, not reactive.

🤝 Human-AI Collaboration

Not replacement but augmentation: AI handles the tedious, humans handle the meaning. Every person gets a PhD-level research assistant.

⚠️ Recursive Self-Improvement

An AI that can improve its own code and architecture could trigger an "intelligence explosion" — rapidly becoming superintelligent. The most uncertain and potentially dangerous scenario.

9. The Complete Arc — Three Posts, 750+ Years

Let's step back and see the full picture:

Part 1 — Logic to Statistics: Llull (1275) → Babbage → Boole → Turing → LISP → search → expert systems → AI winters → linear regression & gradient descent
Part 2 — Perception to Prediction: MLPs → CNNs → RNNs/LSTMs → word embeddings → attention → transformers → BERT → GPT → ChatGPT → foundation models
Part 3 — Action to Understanding: RL → game AI → agents → tool use → reasoning models → open source → alignment → interpretability → AGI

The question that started it all — "How does AI think?" — doesn't have a single answer. It depends on the era.

In 1956, AI "thought" by searching trees of logical deductions. In 1997, it "thought" by evaluating 200 million chess positions per second. In 2016, it "thought" by self-play and Monte Carlo tree search. In 2023, it "thought" by predicting the most likely next token. In 2025, it "thinks" by generating chains of reasoning, calling tools, and verifying its own work.

Tomorrow? We'll see. The only certainty is that the story isn't finished.

But before we look too far ahead, there's a practical question we haven't answered: how does all of this actually work under the hood? How does text become numbers? How are these massive models trained? What happens when you type a prompt and hit Enter? Part 4 dives into the engineering.

Read Part 4: Under the Hood — Tokenizers, Training & Making AI Work →