Ivan Santoso - Blog/how-does-ai-think

Introduction — What Does "Thinking" Mean for a Machine?

When we say "AI thinks," most people picture a neural network crunching millions of numbers inside an NVIDIA GPU. But the story of artificial intelligence is much older than deep learning, older than computers themselves. It begins with philosophers asking whether reasoning could be mechanised — turned into a set of formal rules that any system, human or machine, could execute.

This post is a long, interactive walk through that history. We'll start with 13th-century combinatorial logic, pass through mechanical calculators, formal logic, Turing machines, the first AI programs, search trees, expert systems, and arrive at linear regression — the statistical method that bridged classical AI and the machine learning revolution.

Along the way you'll find interactive demos: play with Boolean logic gates, train a perceptron, talk to a tiny ELIZA chatbot, watch BFS vs DFS explore a tree, fit your own regression line with gradient descent, and drag a 3D loss surface in Three.js.

1. The Dream of Mechanical Thought

1.1 Ramon Llull's Ars Magna (1275)

The Catalan philosopher Ramon Llull designed the Ars Magna — a system of concentric rotating paper discs inscribed with fundamental concepts (goodness, greatness, truth, glory, etc.). By rotating the discs you could mechanically generate all possible combinations of concepts, producing statements like "Goodness is great" or "Truth is glorious."

It was naive by modern standards, but it was the first known attempt to produce logical conclusions through a mechanical device. Llull believed that if you could enumerate all true combinations, you could resolve any theological or philosophical dispute. His work influenced Leibniz four centuries later.

1.2 Leibniz & the Calculus Ratiocinator (1685)

Gottfried Wilhelm Leibniz — co-inventor of calculus — admired Llull's combinatorial approach and dreamed of taking it much further. He proposed two connected ideas:

Characteristica universalis — a universal formal language in which all human knowledge could be expressed precisely, like mathematical notation for everything.
Calculus ratiocinator — a machine (or algorithm) that could manipulate symbols in this language to derive new truths mechanically.

"When there are disputes among persons, we can simply say: Let us calculate, without further ado, and see who is right." — Leibniz

Leibniz also built the Stepped Reckoner (1694), one of the first mechanical calculators capable of multiplication and division via a drum gear mechanism. It was unreliable in practice — the carry mechanism jammed — but the vision of reducing thought to calculation was prophetic.

1.3 Charles Babbage & the Analytical Engine (1837)

The English mathematician Charles Babbage spent decades designing two engines:

Difference Engine (1822 design) — a specialised mechanical calculator for polynomial tables.
Analytical Engine (1837 design) — a general-purpose programmable computer with an arithmetic logic unit (the "Mill"), memory (the "Store"), control flow (conditional branching and loops), and input via punched cards (borrowed from Jacquard looms).

The Analytical Engine was never completed, but its design anticipated the architecture of modern computers by over a century. It is the earliest design that is Turing-complete (though that term wouldn't exist for another hundred years).

1.4 Ada Lovelace — The First Programmer (1843)

Augusta Ada King, Countess of Lovelace, translated an Italian article about the Analytical Engine and added extensive notes of her own. Note G contained a step-by-step algorithm to compute Bernoulli numbers — widely regarded as the first published computer program.

More remarkably, she reflected on the Engine's potential and limits:

"The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform." — Ada Lovelace, Note G (1843)

This became known as Lady Lovelace's Objection — the claim that machines can never be truly creative or intelligent because they merely follow instructions. Alan Turing would directly address this objection a century later, arguing that apparently creative behaviour could emerge from rules complex enough that even their designer couldn't predict the outcome.

1.5 George Boole — The Laws of Thought (1854)

In An Investigation of the Laws of Thought, George Boole demonstrated that logical reasoning could be expressed as algebra:

TRUE = 1, FALSE = 0
AND = multiplication (1 × 1 = 1, anything else = 0)
OR = clipped addition (0 + 0 = 0, otherwise 1)
NOT = subtraction from 1

This was revolutionary: it meant that reasoning about truth and falsehood was a form of computation. Boolean algebra became the mathematical bedrock of digital circuits (Claude Shannon's 1937 master's thesis showed that switching circuits implement Boolean logic) and of every AI system built on formal logic.

🔬 Interactive: Boolean Logic Gate Simulator

Pick two inputs and a gate — see the output live. This is the fundamental building block of digital computation.

A: B: Gate: → Result: 0

A	B	AND

1.6 Gottlob Frege & Formal Predicate Logic (1879)

While Boole handled propositional logic (statements that are simply true or false), Gottlob Frege's Begriffsschrift (1879) created predicate logic — a formal system with variables, quantifiers (∀, ∃), and functions. This allowed statements like "for all x, if x is human then x is mortal" to be expressed and manipulated formally.

Frege's logic became the standard foundation for mathematics (via Russell & Whitehead's Principia Mathematica, 1910–1913) and, later, for the symbolic AI programs that would attempt to mechanise mathematical proof.

1.7 Gödel's Incompleteness Theorems (1931)

Kurt Gödel proved that in any sufficiently powerful formal system (like arithmetic), there exist true statements that cannot be proved within the system. This shattered the dream (pursued by David Hilbert) that all of mathematics could be mechanised.

For AI, the implication is profound: there are fundamental limits to what any formal reasoning system can achieve. But in practice, most real-world problems don't hit these limits — they're hard because of combinatorial explosion, not incompleteness.

2. Alan Turing — Computability & Machine Intelligence

2.1 The Turing Machine (1936)

In "On Computable Numbers, with an Application to the Entscheidungsproblem", Alan Turing introduced an abstract model of computation: an infinite tape divided into cells, a read/write head that moves left or right, and a finite table of rules (the "program").

Though purely theoretical, the Turing Machine defined the boundary of what any algorithmic process can compute. The Church–Turing thesis (formulated independently by Alonzo Church using lambda calculus) states that any effectively calculable function can be computed by a Turing Machine. This is still accepted today as capturing the notion of "algorithm."

2.2 "Can Machines Think?" (1950)

In 1950, Turing published "Computing Machinery and Intelligence" in the journal Mind. He proposed the Imitation Game (now called the Turing Test): a human judge communicates via text with two respondents — one human, one machine. If the judge cannot reliably tell which is which, the machine has demonstrated intelligence.

Turing systematically addressed nine objections to machine intelligence, including:

The Theological Objection — God gave souls only to humans (Turing: why would God be so limited?)
The Mathematical Objection — Gödel's theorems set limits (Turing: humans also can't answer all questions)
Lady Lovelace's Objection — machines can only do what they're told (Turing: complex programs can surprise their creators)
The Argument from Consciousness — machines don't feel (Turing: we can't prove other humans feel either; the test sidesteps this)

Turing also speculated about building a "child machine" that would learn, rather than be explicitly programmed — anticipating machine learning by decades.

3. McCulloch & Pitts — The Artificial Neuron (1943)

Warren McCulloch and Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity". They showed that simplified neuron models — binary threshold units — could compute any Boolean function, effectively implementing propositional logic.

The model: take binary inputs $x_1, x_2, \ldots, x_n$, multiply each by a weight $w_i$, sum them, and fire (output 1) if the sum meets a threshold $\theta$:

$y = \begin{cases} 1 & \text{if } \sum_{i=1}^{n} w_i x_i \geq \theta \\ 0 & \text{otherwise} \end{cases}$

The McCulloch-Pitts neuron: inputs × weights → sum → threshold → output

4. Claude Shannon — Information Theory (1948)

Claude Shannon's "A Mathematical Theory of Communication" (1948) founded information theory, introducing the concept of entropy as a measure of uncertainty:

$H(X) = -\sum_{i} p(x_i) \log_2 p(x_i)$

Information theory became essential to AI via:

Decision trees — which split on the feature that maximally reduces entropy (information gain)
Language models — Shannon himself analysed English text as a stochastic process, estimating redundancy and entropy per character
Communication channels — the noisy-channel model underlies speech recognition, machine translation, and error-correcting codes

Shannon was also a co-organiser of the Dartmouth Conference and wrote one of the first computer chess programs (1950), proposing both the "Type A" (brute-force) and "Type B" (selective search) strategies that would define computer chess for 50 years.

5. The Dartmouth Conference (1956)

In the summer of 1956, a small group gathered at Dartmouth College for a two-month workshop. The proposal, written by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon, stated:

"We propose that a 2 month, 10 man study of artificial intelligence be carried out… The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it."

The conference didn't produce immediate breakthroughs, but it named the field and connected the researchers who would dominate it for decades. Two philosophical camps crystallised:

🧮 Symbolic / "Neat" AI

Intelligence = symbol manipulation. Programs reason by applying logical rules to structured representations of the world. Champions: McCarthy, Newell, Simon, Feigenbaum.

🧠 Connectionist / "Scruffy" AI

Intelligence emerges from networks of simple units connected in complex ways (like neurons). No explicit rules — behaviour is learned from data. Champions: Rosenblatt, later Hinton, Rumelhart.

This tension between engineered rules and learned representations is the central narrative of AI history.

6. The Logic Theorist — The First AI Program (1956)

Allen Newell, Herbert Simon, and programmer J.C. Shaw built the Logic Theorist at RAND Corporation on a JOHNNIAC computer. It proved theorems from Whitehead and Russell's Principia Mathematica using heuristic search.

Its three key techniques:

Symbolic representation — logical propositions encoded as list structures in memory
Heuristic search — exploring a tree of possible proof steps, trying the most promising paths first rather than exhaustive brute force
Means-ends analysis — comparing the current state to the goal and picking the operator that reduces the difference most

The Logic Theorist proved 38 of 52 theorems. For Theorem 2.85, it found a proof more elegant than the one in Principia Mathematica. When Simon attempted to publish the result in the Journal of Symbolic Logic, the paper was rejected — the co-author was a computer program, and the editors didn't know what to make of it.

7. The General Problem Solver — GPS (1959)

Newell and Simon followed with GPS, intended to solve any problem expressible as initial state + goal state + operators. It used means-ends analysis in a loop:

Compare the current state to the goal — what's different?
Find an operator that reduces the biggest difference.
If the operator's preconditions aren't met, recursively solve that sub-problem.
Apply the operator and repeat.

GPS could solve toy puzzles (Tower of Hanoi, missionaries-and-cannibals) but struggled with real-world complexity. The lesson: general reasoning without domain knowledge hits a wall quickly. This insight would drive the expert systems movement a decade later.

8. LISP — The Language of AI (1958)

John McCarthy at MIT created LISP (LISt Processing) — the second-oldest high-level programming language still in use. LISP introduced concepts that became standard across all programming:

Recursive functions as first-class citizens
Dynamic typing and garbage collection
Homoiconicity — programs are lists; LISP can manipulate its own code as data
The if-then-else conditional expression (invented by McCarthy for LISP)
REPL — Read-Eval-Print Loop for interactive development

LISP became the language of AI for decades. Entire hardware platforms (LISP Machines, 1970s–80s) were designed to execute LISP efficiently. Modern descendants include Clojure, Racket, and Common Lisp.

9. The Perceptron — Learning from Data (1958)

While the symbolists built theorem provers, Frank Rosenblatt at Cornell took the connectionist path. His Perceptron was a hardware device (the Mark I Perceptron used 400 photocells randomly wired to a layer of artificial neurons) that could learn to classify visual patterns.

The key innovation over McCulloch-Pitts: automatic learning. Instead of a human setting the weights, the Perceptron adjusted them from labelled examples using a simple update rule:

$w_i \leftarrow w_i + \eta \cdot (y_{\text{true}} - y_{\text{pred}}) \cdot x_i$

Rosenblatt proved the Perceptron Convergence Theorem: if the training data is linearly separable, this rule is guaranteed to converge to a set of weights that correctly classifies all examples.

The New York Times reported (1958): "The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The hype cycle had begun.

🧪 Interactive: Perceptron Learning

Click the canvas to place blue (+1) or red (−1) points. The perceptron learns a separating line in real time. Toggle the class with the button, then train manually or automatically.

Learning rate: 0.10

Epoch: 0 · Misclassified: 0

9.1 The XOR Problem & "Perceptrons" (1969)

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, rigorously proving that a single-layer perceptron cannot learn XOR — or any function that isn't linearly separable.

While multi-layer networks could theoretically solve XOR, no efficient training algorithm was known at the time (backpropagation wouldn't arrive until 1986). The book's influence was enormous: funding for neural network research evaporated, and the field entered its first winter. Connectionism wouldn't recover for nearly two decades.

10. ELIZA — The First Chatbot (1966)

Joseph Weizenbaum at MIT created ELIZA, simulating a Rogerian psychotherapist. The program used pattern matching and keyword substitution: it scanned user input for trigger words (like "mother," "feel," "I am"), applied transformation rules, and reflected statements back as questions.

ELIZA had zero understanding of anything. Yet people interacting with it became emotionally attached. Weizenbaum's own secretary asked him to leave the room so she could have a private conversation with the program. This disturbed Weizenbaum deeply.

"What I had not realized is that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people." — Joseph Weizenbaum, 1976

ELIZA raises a question that's still relevant in the age of ChatGPT: how much of "intelligence" is in the system, and how much is projection from the user?

💬 Interactive: Talk to Mini-ELIZA

A tiny recreation of ELIZA's pattern-matching approach. No AI, no neural networks — just regex and reflection.

Hello. I am ELIZA. How are you feeling today?

11. Search Algorithms — How AI Explores

At its core, classical AI reduces intelligence to representation (encoding the world as states, actions, and goals) and search (exploring the space of possible solutions efficiently).

11.1 Uninformed Search: BFS & DFS

Breadth-First Search (BFS) expands all neighbours at the current depth before going deeper — like ripples from a stone in a pond. It guarantees the shortest path (in unweighted graphs) but can consume enormous memory.

Depth-First Search (DFS) dives deep along one branch before backtracking. It uses $O(bd)$ memory (where $b$ is the branching factor and $d$ is the depth) but may find a longer path or get stuck in infinite branches.

🌳 Interactive: BFS vs DFS Tree Traversal

Watch how BFS (queue) and DFS (stack) explore a binary tree differently. Toggle between them and step through.

Visited: 0 · Frontier: 1

11.2 A* Search (1968)

Published by Peter Hart, Nils Nilsson, and Bertram Raphael at Stanford Research Institute. A* evaluates each node with:

$f(n) = g(n) + h(n)$

where $g(n)$ = cost from start to $n$, and $h(n)$ = heuristic estimate of remaining cost. If $h$ never overestimates (admissible), A* is guaranteed optimal. If $h$ also satisfies the triangle inequality (consistent), each node is expanded at most once.

A* is still used in GPS navigation, video game pathfinding (the navmesh industry standard), robotics motion planning, and even protein folding pipelines.

11.3 Game Trees — Minimax & Alpha-Beta Pruning

In adversarial settings (chess, Go, etc.), you can't just search for the best path — your opponent is actively trying to make you lose. The solution:

Minimax — two players alternate: the "maximiser" picks the move with the highest value, the "minimiser" picks the lowest. The tree is explored to some depth, where leaf nodes are scored by a heuristic (evaluation function).
Alpha-Beta Pruning (first described 1958–1963) — maintains two bounds (α and β) and prunes branches that can't possibly influence the root decision. In practice this reduces the effective branching factor from $b$ to roughly $\sqrt{b}$, doubling the feasible search depth.

Deep Blue (1997) used alpha-beta search with hardware-accelerated evaluation, examining 200 million positions per second. It evaluated board positions with hand-tuned features: material balance, king safety, pawn structure, piece mobility. No learning. No neural nets. Just search and knowledge.

12. Expert Systems — The Knowledge Revolution (1970s–1980s)

Edward Feigenbaum at Stanford championed the idea that for real-world problems, domain knowledge matters more than general reasoning. The result: expert systems that encoded specialist knowledge as IF-THEN rules.

DENDRAL (1965)

Identified molecular structures from mass spectrometry. The first expert system, built at Stanford.

MYCIN (1976)

Diagnosed bacterial infections with ~600 rules. Performed comparably to human infectious disease experts.

R1/XCON (1980)

Configured VAX computer orders for DEC. Saved $40M/year. One of the first commercial AI successes.

Cyc (1984–present)

Doug Lenat's attempt to encode all common-sense knowledge. Millions of rules. Still ongoing 40+ years later.

Expert systems were transparent (every conclusion had a traceable rule chain) but brittle: novel situations outside the rules caused silent failures. Building and maintaining knowledge bases was labour-intensive. This problem became known as the knowledge acquisition bottleneck.

13. The AI Winters

AI history has two major "winters" — periods where hype collapsed, funding evaporated, and researchers either left the field or rebranded their work:

First AI Winter (1974–1980) — Triggered by the Lighthill Report (UK, 1973), which concluded that "combinatorial explosion" made most AI intractable, and criticised the field's failure to deliver on grand promises. UK cut almost all AI funding. The US also significantly reduced DARPA support.
Second AI Winter (1987–1993) — Expert system companies collapsed when customers realised the systems were expensive to build and brittle to maintain. The LISP Machine market died as cheaper conventional hardware caught up. Japan's ambitious Fifth Generation Computer Project (1982–1992) was deemed a failure. The word "AI" became toxic in grant proposals; researchers rebranded as "machine learning," "data mining," or "intelligent systems."

These winters shaped the culture of modern ML: the field became deeply empirical, benchmark-driven, and sceptical of grand claims. "Show me the numbers on the test set" replaced "imagine what this could do."

14. The Quiet Revolution

While the AI label was toxic, important work continued under different names. The common thread: let the data decide. Instead of encoding expert knowledge as rules, provide labelled examples and let the algorithm discover patterns.

14.1 Bayesian Methods

Bayes' theorem (Thomas Bayes, 1763) provides a principled way to update beliefs given new evidence:

$P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)}$

Bayesian methods power spam filters (Naive Bayes, 1998), medical diagnosis systems, and modern probabilistic programming languages.

14.2 Decision Trees (1984, 1993)

CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993) automatically learned IF-THEN rules from data by recursively splitting on the feature that maximally reduced impurity (Gini index or information gain). These were essentially automatically generated expert systems.

14.3 Backpropagation — Neural Nets Return (1986)

Rumelhart, Hinton & Williams published the modern formulation of backpropagation, showing how to train multi-layer neural networks by propagating error gradients backwards through the network using the chain rule. This solved the XOR problem and revived connectionism.

However, compute power was still limited. Deep networks were slow to train, and SVMs (Vapnik, 1995) delivered comparable results with stronger theoretical guarantees. Neural networks wouldn't dominate until GPUs and massive datasets arrived circa 2012.

14.4 Hidden Markov Models & Speech

Hidden Markov Models (HMMs) powered speech recognition at IBM and Bell Labs through the 1970s–2000s. The Viterbi algorithm (1967) efficiently found the most likely sequence of hidden states, enabling systems like IBM ViaVoice and Dragon NaturallySpeaking.

14.5 Support Vector Machines (1995)

Vladimir Vapnik introduced SVMs, which find the maximum-margin hyperplane separating two classes. Combined with the "kernel trick" (mapping data to higher dimensions), SVMs were the state-of-the-art classifier for a decade. They dominated competitions like handwritten digit recognition (MNIST) until deep learning surpassed them.

15. Linear Regression — Where Statistics Meets AI

Linear regression is one of the oldest and most important algorithms in all of data science. Invented by Adrien-Marie Legendre (1805, method of least squares) and independently by Carl Friedrich Gauss (1809), it predates AI by 150 years but is the conceptual bridge between classical statistics and modern machine learning.

Why does linear regression matter for AI? Because:

It's the simplest example of a model that learns from data
Its training procedure (gradient descent) is the same algorithm that trains every neural network, from a single perceptron to GPT-4
It introduces the loss function concept — the quantitative measure of "how wrong is the model?" that drives all of modern ML
It demonstrates overfitting vs generalisation, the central trade-off of machine learning

15.1 The Model

Given $n$ data points $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$, find the line $\hat{y} = wx + b$ that best fits the data. "Best" means minimising the Mean Squared Error (MSE):

$\text{MSE}(w, b) = \frac{1}{n}\sum_{i=1}^{n}\left(y_i - (wx_i + b)\right)^2$

This has a beautiful closed-form solution (the normal equation):

$w = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad b = \bar{y} - w\bar{x}$

But the closed form doesn't scale well to millions of features. In practice, we use gradient descent.

15.2 Gradient Descent — The Engine of Modern ML

Gradient descent is an iterative optimisation algorithm. At each step, it computes the gradient (direction of steepest ascent) of the loss function and takes a step in the opposite direction:

$w \leftarrow w - \eta \frac{\partial \text{MSE}}{\partial w}, \quad b \leftarrow b - \eta \frac{\partial \text{MSE}}{\partial b}$

For MSE with a linear model, the partial derivatives are:

$\frac{\partial \text{MSE}}{\partial w} = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i) \cdot x_i, \quad \frac{\partial \text{MSE}}{\partial b} = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)$

The learning rate $\eta$ controls step size: too large → oscillation or divergence; too small → painfully slow convergence. Choosing it well is crucial.

📈 Interactive: Fit a Linear Regression with Gradient Descent

Click the chart to add data points (or click "Add Random"). Then watch gradient descent fit the best line. The loss chart on the right shows MSE decreasing over training steps.

Learning Rate: 0.050

w = 0.000 · b = 0.000 · MSE = — · Steps: 0

15.3 The MSE Loss Surface in 3D

For linear regression with one feature, the loss function $\text{MSE}(w, b)$ is a surface over the two-dimensional parameter space. Because MSE is a sum of squared terms, this surface is always a convex paraboloid (bowl shape) — meaning gradient descent is guaranteed to find the global minimum.

The interactive visualisation below shows this loss surface in 3D. The red sphere starts at a random point and rolls downhill via gradient descent. Drag to rotate the view.

🏔️ Interactive: 3D Loss Surface (Three.js)

Drag to orbit the camera. Click "Descend" to restart gradient descent from a random starting point. Watch how the learning rate affects the path.

Learning Rate: 0.050

Step: 0 · w = — · b = — · MSE = —

16. AI Funding & Progress — A Visual History

The chart below traces the approximate arc of AI research funding and major breakthroughs. Hover over the red markers for milestone labels. Notice the two winter dips and the exponential climb after the deep learning revolution (~2012).

📊 Interactive: AI Progress Timeline (Chart.js)

17. Expanded Timeline — 750 Years of Machine Intelligence

Year	Milestone	Why It Matters
1275	Ramon Llull's Ars Magna	First mechanical reasoning device
1694	Leibniz's Stepped Reckoner	Mechanical ×/÷ calculator
1805	Legendre's least squares	Linear regression — foundation of ML
1837	Babbage's Analytical Engine	First general-purpose computer design
1843	Ada Lovelace's Note G	First algorithm + machine creativity debate
1854	Boole's Laws of Thought	Logic as algebra
1879	Frege's Begriffsschrift	Predicate logic — formal reasoning with variables
1931	Gödel's Incompleteness	Fundamental limits of formal systems
1936	Turing Machine	Defines computability
1943	McCulloch-Pitts neuron	Mathematical model of biological neurons
1948	Shannon's Information Theory	Entropy, information, communication limits
1950	Turing's "Can Machines Think?"	The Turing Test
1956	Logic Theorist + Dartmouth	First AI program; field is born & named
1958	Perceptron & LISP	Learning from data + the language of AI
1959	General Problem Solver	Means-ends analysis for any domain
1965	DENDRAL	First expert system
1966	ELIZA	First chatbot
1967	Viterbi algorithm	Efficient HMM decoding — powers speech recognition
1968	A* search	Optimal heuristic search
1969	Minsky & Papert's Perceptrons	XOR problem → first neural net winter
1974–80	First AI Winter	Lighthill Report; funding collapses
1976	MYCIN	Expert medical diagnosis
1980	R1/XCON	First commercially successful expert system
1984	CART decision trees	Learning rules from data automatically
1986	Backpropagation	Training multi-layer networks
1987–93	Second AI Winter	Expert systems collapse
1995	SVMs (Vapnik)	Maximum-margin classifiers dominate
1997	Deep Blue beats Kasparov	Search + evaluation triumphs

18. The Big Picture — From Logic to Learning

This post has covered over 750 years of intellectual history:

Mechanical reasoning (Llull → Leibniz → Babbage) — Can thought be encoded in gears?
Formal logic (Boole → Frege → Gödel → Turing) — Reasoning as symbol manipulation; defining the limits of computation.
Symbolic AI (Logic Theorist → GPS → Expert Systems) — Hand-crafted rules + heuristic search.
Connectionism (McCulloch-Pitts → Perceptron → XOR crisis → Backprop) — Simplified neurons that learn from examples.
Statistical methods (Legendre → Bayes → Decision Trees → SVMs) — Let the data decide; the machine finds its own rules.

Linear regression might seem humble next to GPT-4, but the conceptual leap it represents is enormous: instead of a human encoding rules, the machine discovers its own parameters from data. And the method it uses — gradient descent — is the exact same algorithm that trains every neural network today, from a 2-weight linear model to a trillion-parameter LLM.

The next chapter — deep learning, convolutional nets, transformers, and the age of foundation models — is built on everything we've covered here. Read Part 2 →

Introduction — What Does "Thinking" Mean for a Machine?

Era 0 — Before Computers

1. The Dream of Mechanical Thought

1.1 Ramon Llull's Ars Magna (1275)

1.2 Leibniz & the Calculus Ratiocinator (1685)

1.3 Charles Babbage & the Analytical Engine (1837)

1.4 Ada Lovelace — The First Programmer (1843)

1.5 George Boole — The Laws of Thought (1854)

🔬 Interactive: Boolean Logic Gate Simulator

1.6 Gottlob Frege & Formal Predicate Logic (1879)

1.7 Gödel's Incompleteness Theorems (1931)

Era 1 — Theoretical Foundations (1930s–1950s)

2. Alan Turing — Computability & Machine Intelligence

2.1 The Turing Machine (1936)

2.2 "Can Machines Think?" (1950)

3. McCulloch & Pitts — The Artificial Neuron (1943)

4. Claude Shannon — Information Theory (1948)

5. The Dartmouth Conference (1956)

🧮 Symbolic / "Neat" AI

🧠 Connectionist / "Scruffy" AI

Era 2 — The Golden Age (1956–1974)

6. The Logic Theorist — The First AI Program (1956)

7. The General Problem Solver — GPS (1959)

8. LISP — The Language of AI (1958)

9. The Perceptron — Learning from Data (1958)

🧪 Interactive: Perceptron Learning

9.1 The XOR Problem & "Perceptrons" (1969)

10. ELIZA — The First Chatbot (1966)

💬 Interactive: Talk to Mini-ELIZA

Era 3 — Search & Knowledge (1960s–1980s)

11. Search Algorithms — How AI Explores

11.1 Uninformed Search: BFS & DFS

🌳 Interactive: BFS vs DFS Tree Traversal

11.2 A* Search (1968)

11.3 Game Trees — Minimax & Alpha-Beta Pruning

12. Expert Systems — The Knowledge Revolution (1970s–1980s)

DENDRAL (1965)

MYCIN (1976)

R1/XCON (1980)

Cyc (1984–present)

13. The AI Winters

Era 4 — The Statistical Turn (1980s–2000s)

14. The Quiet Revolution

14.1 Bayesian Methods

14.2 Decision Trees (1984, 1993)

14.3 Backpropagation — Neural Nets Return (1986)

14.4 Hidden Markov Models & Speech

14.5 Support Vector Machines (1995)

15. Linear Regression — Where Statistics Meets AI

15.1 The Model

15.2 Gradient Descent — The Engine of Modern ML

📈 Interactive: Fit a Linear Regression with Gradient Descent

15.3 The MSE Loss Surface in 3D

🏔️ Interactive: 3D Loss Surface (Three.js)

16. AI Funding & Progress — A Visual History

📊 Interactive: AI Progress Timeline (Chart.js)

17. Expanded Timeline — 750 Years of Machine Intelligence

18. The Big Picture — From Logic to Learning

Further Reading