Ivan Santoso - Blog/how-does-ai-think-part-5

📖 This is Part 5 of the "How Does AI Think?" series. Part 1 · Part 2 · Part 3 · ← Part 4: Under the Hood

Introduction — From Theory to Practice

Parts 1–4 traced the ideas, architectures, and engineering behind AI. But understanding how transformers work is different from knowing how to build something useful with them.

This final part bridges that gap. We'll explore where AI is making a real impact today, learn the building blocks every developer needs (RAG, vector databases, embeddings, agents), tour the open-source ecosystem, and build an intuition for choosing the right model and deployment strategy for your project.

Less theory, more practice. Let's go.

1. Real-World AI Applications (2024–2026)

Beyond the hype, AI is transforming specific domains in measurable ways. Here are the areas where the impact is already undeniable:

🏗️ Interactive: AI Application Gallery

Click a category to see how AI is being deployed in that domain.

1.1 The Impact Numbers

📊 Interactive: AI Market Size & Adoption

2. The RAG Pattern

LLMs have a fundamental limit: they only know what was in their training data (up to a cutoff date). They can't access your company's documents, today's news, or your personal notes — unless you give them that information at inference time.

RAG (Retrieval-Augmented Generation) solves this by combining a retriever (finds relevant documents) with a generator (the LLM that uses those documents to answer):

User asks a question

"What was our Q3 revenue in Asia?"

Embed the query

Convert the question into a vector using an embedding model

Search the vector database

Find the most similar document chunks by cosine similarity

Augment the prompt

Insert the retrieved chunks into the LLM's context: "Based on these documents: [chunks], answer: [question]"

Generate the answer

The LLM synthesises a grounded answer from the retrieved context

2.1 Why RAG Matters

No hallucination — the model cites actual documents, not fabricated facts
Up-to-date — add new documents anytime, no retraining needed
Domain-specific — works with your internal data (legal, medical, financial)
Auditable — you can see exactly which documents the answer came from
Cost-effective — cheaper than fine-tuning and much cheaper than pre-training

🔎 Interactive: RAG Pipeline Simulator

Type a query and see how RAG retrieves relevant chunks from a document store, then feeds them to the LLM to produce a grounded answer.

2.2 RAG Best Practices

Chunk Size Matters

Too small → missing context. Too large → noise dilutes relevance. Typical: 256–1024 tokens per chunk with 50-token overlap.

Hybrid Search

Combine vector search (semantic similarity) with keyword search (BM25) for best results. Reranking with a cross-encoder further improves quality.

Contextual Retrieval

Anthropic's technique: prepend a summary of the document context to each chunk before embedding. Reduces retrieval failures by 49%.

Evaluation

Measure retrieval quality (recall@k) and answer quality (faithfulness, relevance). Tools: RAGAS, TruLens, LangSmith for automated RAG evaluation.

3. Vector Similarity Search

At the heart of RAG is vector search: finding the most similar vectors in a database. The query and all documents are converted to vectors by an embedding model, then similarity is measured with:

$\text{cosine\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

Two texts about similar topics will have cosine similarity close to 1.0, even if they use completely different words.

🎯 Interactive: Vector Similarity Search

Type a query and see which documents are most similar in embedding space. The 2D projection shows how documents cluster by topic.

3.1 Popular Vector Databases

Database	Type	Max Vectors	Best For
Pinecone	Managed cloud	Billions	Production SaaS, zero-ops
Weaviate	Open source / cloud	Billions	Hybrid search, multi-modal
Qdrant	Open source / cloud	Billions	Filtering, payload-rich data
ChromaDB	Open source (local)	Millions	Prototyping, local dev
pgvector	PostgreSQL extension	Millions	Existing Postgres stacks
FAISS	Library (Meta)	Billions	Research, custom pipelines

3.2 Embedding Models

The quality of your vector search depends on the embedding model. Modern embedding models are small transformers optimised for semantic similarity:

OpenAI text-embedding-3-large — 3072 dimensions, excellent general performance
Cohere embed-v3 — multilingual, supports semantic search and classification
BGE / GTE (open source) — competitive with commercial models, run locally
Nomic Embed — open source, 8K context, strong performance on MTEB benchmark
Jina Embeddings v3 — multilingual, 8K context, multiple task adapters

4. Building AI Applications

The ecosystem of tools for building AI-powered applications has exploded. Here's the landscape:

4.1 LLM Frameworks

LangChain / LangGraph

The Swiss Army knife: chains, agents, RAG pipelines, memory, callbacks. LangGraph adds stateful multi-agent workflows with cycles.

LlamaIndex

Specialises in data ingestion and RAG. Excellent document loaders, index types, and query engines. Best for knowledge-grounded apps.

Vercel AI SDK

Streaming UI components for React/Next.js. Server-side streaming, tool calling, generative UI. Perfect for web apps.

Semantic Kernel (Microsoft)

Enterprise-grade: C# and Python. Plugins, planners, and memory. Integrated with Azure AI services. Good for .NET shops.

4.2 The AI Application Stack

🖥️

Frontend

React / Next.js / Vue — streaming UI, Vercel AI SDK, markdown rendering

⚡

API Layer

FastAPI / Express — routes, auth, rate limiting, streaming SSE/WebSocket

🧠

Orchestration

LangChain / LlamaIndex — chains, agents, RAG pipelines, tool calling

🔍

Retrieval

Vector DB (Pinecone / Qdrant) + BM25 + reranker for hybrid search

🤖

LLM Provider

OpenAI / Anthropic / Google API — or self-hosted via Ollama / vLLM

📊

Observability

LangSmith / Helicone / Langfuse — trace calls, latency, costs, evals

5. The Open-Source Ecosystem

You don't need an API key to use powerful AI. Open-source models have reached remarkable quality, and tools like Ollama make running them locally as easy as ollama run llama3.

5.1 Key Open-Source Models (2025-2026)

📊 Interactive: Model Size Calculator

See how much VRAM/RAM you need to run popular open-source models at different quantization levels.

Quantization:

5.2 Running Models Locally

Ollama

Dead simple: ollama run llama3. Manages downloads, quantisation, and a local API (OpenAI-compatible). Best starting point.

llama.cpp / GGUF

The engine behind Ollama. Optimised C++ inference with Apple Metal, CUDA, and CPU support. GGUF quantised formats (Q4_K_M, Q5_K_S).

vLLM

Production-grade inference server. PagedAttention, continuous batching, tensor parallelism. OpenAI-compatible API for self-hosting.

LM Studio

GUI app for running models locally. Browse, download, and chat with GGUF models. Great for non-developers.

HuggingFace TGI

Text Generation Inference by HF. Flash Attention, quantisation, watermarking. Easy Docker deployment.

MLX (Apple)

Apple's ML framework optimised for Apple Silicon. Runs models efficiently on Mac unified memory. Growing ecosystem.

5.3 When to Use Local vs Cloud

Factor	Local / Self-hosted	Cloud API
Privacy	✅ Data stays on your hardware	⚠️ Data sent to provider
Quality	Good — 8B-70B models	✅ Frontier (GPT-4o, Claude 3.5)
Cost at scale	✅ Fixed hardware cost	Per-token (can get expensive)
Latency	✅ No network round-trip	50-200ms overhead
Setup effort	Medium (Ollama is easy)	✅ One API key
Scaling	Manual GPU provisioning	✅ Automatic

6. The Agent Architecture

An AI agent is an LLM that can take actions: call APIs, query databases, write files, browse the web, run code. The LLM serves as the "brain" that decides what to do next based on the current state and goal.

6.1 Agent Patterns

ReAct (Reason + Act)

The model alternates between reasoning ("I need to search for X") and acting (calling a search tool). Most common pattern.

Plan & Execute

First create a plan of all steps, then execute them. Better for complex multi-step tasks. Can revise the plan mid-execution.

Tool Calling

Define typed tools (functions) the model can invoke. OpenAI function calling, Anthropic tool use. The model outputs structured tool invocations.

Multi-Agent

Multiple specialised agents that collaborate: researcher, coder, reviewer. Frameworks: AutoGen, CrewAI, LangGraph.

🤖 Interactive: Agent Flowchart Builder

See how a typical AI agent processes a task through the ReAct loop.

6.2 MCP — Model Context Protocol

MCP (Anthropic, 2024) is an open protocol that standardises how AI models connect to external tools and data sources. Instead of each app implementing its own tool integrations, MCP defines a universal interface:

Resources — data the model can read (files, database records, API responses)
Tools — actions the model can take (run query, write file, send email)
Prompts — reusable prompt templates for common tasks

MCP servers are now supported in VS Code (Copilot), Cursor, Claude Desktop, and dozens of other tools. A single MCP server for Postgres, for example, lets any AI assistant query your database.

7. Production Considerations

7.1 The Prototype → Production Gap

Getting a demo working is easy. Making it reliable, fast, safe, and cost-effective at scale is where the real engineering happens:

Latency

Users expect sub-2-second responses. Use streaming, smaller models for simple tasks, caching frequent queries, and edge deployment for global low-latency.

Cost Control

Route simple queries to cheap/small models, complex ones to frontier models. Cache embeddings. Batch requests. Monitor per-query cost.

Reliability

LLMs are non-deterministic. Use structured output (JSON mode), validation, retries, fallback models, and comprehensive evals.

Safety

Guard against prompt injection, jailbreaks, and data leakage. Use input/output guardrails (NeMo Guardrails, Llama Guard), content filtering, PII detection.

7.2 Model Routing

Not every query needs GPT-4. Smart routing sends each request to the right model:

📊 Interactive: Model Routing Cost Comparison

See how routing queries to appropriate model tiers can reduce costs by 60-80%.

Monthly queries: 100K

8. The New Developer Experience

AI hasn't replaced developers — it's changed what it means to develop software. Here's what matters now:

8.1 Skills That Matter More

System design — understanding how components fit together is more important than ever when AI can generate individual components
Problem decomposition — breaking complex problems into AI-solvable chunks
Evaluation & debugging — knowing when AI output is wrong, and why
Prompt engineering — communicating clearly with AI tools to get reliable results
Domain expertise — AI amplifies what you already know. Deep knowledge in healthcare, finance, or security becomes more valuable, not less

8.2 The New Workflow

Before: Write Code

Open blank file → think → type → debug → refactor → test → commit. Most time spent on syntax and implementation details.

After: Direct Code

Describe intent → review AI output → refine → verify → test → commit. Most time spent on architecture decisions and verification.

The role is shifting from author to architect + editor. You spend more time understanding what to build and verifying it works correctly, less time on the mechanical act of writing each line.

"The best developers in 2026 aren't the fastest typists — they're the clearest thinkers."

8.3 Getting Started: Your First AI Project

If you've read all five parts and want to build something, here's a practical starting path:

Week 1: Install Ollama, run ollama run llama3, experiment with different prompts via the CLI
Week 2: Build a simple chatbot with the Ollama API (Python/FastAPI or Node/Express + a simple frontend)
Week 3: Add RAG — load a few documents into ChromaDB, query them before generating answers
Week 4: Add tool calling — let your chatbot search the web or query a database using function calling
Week 5: Deploy it — host with Docker, add authentication, try streaming responses to the frontend

Each step builds on the last, and at the end you'll have a genuinely useful AI application with RAG, tools, and a real UI.

9. The Complete Arc — Five Parts, One Story

This series has covered the full sweep of AI — from medieval logic machines to autonomous coding agents:

Part 1 — Logic to Statistics: The intellectual foundations, from Llull and Leibniz through Turing, LISP, AI winters, and the statistical revolution.
Part 2 — Perception to Prediction: Neural nets, deep learning, CNNs, RNNs, attention, transformers, BERT, GPT, and the foundation model paradigm.
Part 3 — Action to Understanding: Reinforcement learning, agents, chain-of-thought, open source vs. closed, alignment, and the AGI question.
Part 4 — Under the Hood: Tokenizers, embeddings, training at scale, LoRA, RLHF, inference optimization, prompt engineering, and multimodal models.
Part 5 — AI in the Wild: Real applications, RAG, vector databases, the developer toolkit, local models, agents, deployment, and building your first AI app.

Understanding AI isn't just about knowing what a transformer is. It's about understanding the full stack — from the math behind attention to the engineering of inference servers to the design of useful applications.

The field will keep moving. New architectures will emerge. New capabilities will surprise us. But the patterns we've covered — tokenize, embed, attend, generate, retrieve, act — will remain the foundation for a long time.

Now go build something.