How Does AI Think? Part 5 — AI in the Wild: Applications, Building Blocks & the Developer's Toolkit
📖 This is Part 5 of the "How Does AI Think?" series. Part 1 · Part 2 · Part 3 · ← Part 4: Under the Hood
Introduction — From Theory to Practice
Parts 1–4 traced the ideas, architectures, and engineering behind AI. But understanding how transformers work is different from knowing how to build something useful with them.
This final part bridges that gap. We'll explore where AI is making a real impact today, learn the building blocks every developer needs (RAG, vector databases, embeddings, agents), tour the open-source ecosystem, and build an intuition for choosing the right model and deployment strategy for your project.
Less theory, more practice. Let's go.
1. Real-World AI Applications (2024–2026)
Beyond the hype, AI is transforming specific domains in measurable ways. Here are the areas where the impact is already undeniable:
🏗️ Interactive: AI Application Gallery
Click a category to see how AI is being deployed in that domain.
1.1 The Impact Numbers
📊 Interactive: AI Market Size & Adoption
2. The RAG Pattern
LLMs have a fundamental limit: they only know what was in their training data (up to a cutoff date). They can't access your company's documents, today's news, or your personal notes — unless you give them that information at inference time.
RAG (Retrieval-Augmented Generation) solves this by combining a retriever (finds relevant documents) with a generator (the LLM that uses those documents to answer):
User asks a question
"What was our Q3 revenue in Asia?"
Embed the query
Convert the question into a vector using an embedding model
Search the vector database
Find the most similar document chunks by cosine similarity
Augment the prompt
Insert the retrieved chunks into the LLM's context: "Based on these documents: [chunks], answer: [question]"
Generate the answer
The LLM synthesises a grounded answer from the retrieved context
2.1 Why RAG Matters
- No hallucination — the model cites actual documents, not fabricated facts
- Up-to-date — add new documents anytime, no retraining needed
- Domain-specific — works with your internal data (legal, medical, financial)
- Auditable — you can see exactly which documents the answer came from
- Cost-effective — cheaper than fine-tuning and much cheaper than pre-training
🔎 Interactive: RAG Pipeline Simulator
Type a query and see how RAG retrieves relevant chunks from a document store, then feeds them to the LLM to produce a grounded answer.
2.2 RAG Best Practices
Chunk Size Matters
Too small → missing context. Too large → noise dilutes relevance. Typical: 256–1024 tokens per chunk with 50-token overlap.
Hybrid Search
Combine vector search (semantic similarity) with keyword search (BM25) for best results. Reranking with a cross-encoder further improves quality.
Contextual Retrieval
Anthropic's technique: prepend a summary of the document context to each chunk before embedding. Reduces retrieval failures by 49%.
Evaluation
Measure retrieval quality (recall@k) and answer quality (faithfulness, relevance). Tools: RAGAS, TruLens, LangSmith for automated RAG evaluation.
3. Vector Similarity Search
At the heart of RAG is vector search: finding the most similar vectors in a database. The query and all documents are converted to vectors by an embedding model, then similarity is measured with:
$\text{cosine\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$
Two texts about similar topics will have cosine similarity close to 1.0, even if they use completely different words.
🎯 Interactive: Vector Similarity Search
Type a query and see which documents are most similar in embedding space. The 2D projection shows how documents cluster by topic.
3.1 Popular Vector Databases
| Database | Type | Max Vectors | Best For |
|---|---|---|---|
| Pinecone | Managed cloud | Billions | Production SaaS, zero-ops |
| Weaviate | Open source / cloud | Billions | Hybrid search, multi-modal |
| Qdrant | Open source / cloud | Billions | Filtering, payload-rich data |
| ChromaDB | Open source (local) | Millions | Prototyping, local dev |
| pgvector | PostgreSQL extension | Millions | Existing Postgres stacks |
| FAISS | Library (Meta) | Billions | Research, custom pipelines |
3.2 Embedding Models
The quality of your vector search depends on the embedding model. Modern embedding models are small transformers optimised for semantic similarity:
- OpenAI text-embedding-3-large — 3072 dimensions, excellent general performance
- Cohere embed-v3 — multilingual, supports semantic search and classification
- BGE / GTE (open source) — competitive with commercial models, run locally
- Nomic Embed — open source, 8K context, strong performance on MTEB benchmark
- Jina Embeddings v3 — multilingual, 8K context, multiple task adapters
4. Building AI Applications
The ecosystem of tools for building AI-powered applications has exploded. Here's the landscape:
4.1 LLM Frameworks
LangChain / LangGraph
The Swiss Army knife: chains, agents, RAG pipelines, memory, callbacks. LangGraph adds stateful multi-agent workflows with cycles.
LlamaIndex
Specialises in data ingestion and RAG. Excellent document loaders, index types, and query engines. Best for knowledge-grounded apps.
Vercel AI SDK
Streaming UI components for React/Next.js. Server-side streaming, tool calling, generative UI. Perfect for web apps.
Semantic Kernel (Microsoft)
Enterprise-grade: C# and Python. Plugins, planners, and memory. Integrated with Azure AI services. Good for .NET shops.
4.2 The AI Application Stack
Frontend
React / Next.js / Vue — streaming UI, Vercel AI SDK, markdown rendering
API Layer
FastAPI / Express — routes, auth, rate limiting, streaming SSE/WebSocket
Orchestration
LangChain / LlamaIndex — chains, agents, RAG pipelines, tool calling
Retrieval
Vector DB (Pinecone / Qdrant) + BM25 + reranker for hybrid search
LLM Provider
OpenAI / Anthropic / Google API — or self-hosted via Ollama / vLLM
Observability
LangSmith / Helicone / Langfuse — trace calls, latency, costs, evals
5. The Open-Source Ecosystem
You don't need an API key to use powerful AI. Open-source models have reached
remarkable quality, and tools like Ollama make running them locally
as easy as ollama run llama3.
5.1 Key Open-Source Models (2025-2026)
📊 Interactive: Model Size Calculator
See how much VRAM/RAM you need to run popular open-source models at different quantization levels.
5.2 Running Models Locally
Ollama
Dead simple: ollama run llama3. Manages downloads,
quantisation, and a local API (OpenAI-compatible). Best starting point.
llama.cpp / GGUF
The engine behind Ollama. Optimised C++ inference with Apple Metal, CUDA, and CPU support. GGUF quantised formats (Q4_K_M, Q5_K_S).
vLLM
Production-grade inference server. PagedAttention, continuous batching, tensor parallelism. OpenAI-compatible API for self-hosting.
LM Studio
GUI app for running models locally. Browse, download, and chat with GGUF models. Great for non-developers.
HuggingFace TGI
Text Generation Inference by HF. Flash Attention, quantisation, watermarking. Easy Docker deployment.
MLX (Apple)
Apple's ML framework optimised for Apple Silicon. Runs models efficiently on Mac unified memory. Growing ecosystem.
5.3 When to Use Local vs Cloud
| Factor | Local / Self-hosted | Cloud API |
|---|---|---|
| Privacy | ✅ Data stays on your hardware | ⚠️ Data sent to provider |
| Quality | Good — 8B-70B models | ✅ Frontier (GPT-4o, Claude 3.5) |
| Cost at scale | ✅ Fixed hardware cost | Per-token (can get expensive) |
| Latency | ✅ No network round-trip | 50-200ms overhead |
| Setup effort | Medium (Ollama is easy) | ✅ One API key |
| Scaling | Manual GPU provisioning | ✅ Automatic |
6. The Agent Architecture
An AI agent is an LLM that can take actions: call APIs, query databases, write files, browse the web, run code. The LLM serves as the "brain" that decides what to do next based on the current state and goal.
6.1 Agent Patterns
ReAct (Reason + Act)
The model alternates between reasoning ("I need to search for X") and acting (calling a search tool). Most common pattern.
Plan & Execute
First create a plan of all steps, then execute them. Better for complex multi-step tasks. Can revise the plan mid-execution.
Tool Calling
Define typed tools (functions) the model can invoke. OpenAI function calling, Anthropic tool use. The model outputs structured tool invocations.
Multi-Agent
Multiple specialised agents that collaborate: researcher, coder, reviewer. Frameworks: AutoGen, CrewAI, LangGraph.
🤖 Interactive: Agent Flowchart Builder
See how a typical AI agent processes a task through the ReAct loop.
6.2 MCP — Model Context Protocol
MCP (Anthropic, 2024) is an open protocol that standardises how AI models connect to external tools and data sources. Instead of each app implementing its own tool integrations, MCP defines a universal interface:
- Resources — data the model can read (files, database records, API responses)
- Tools — actions the model can take (run query, write file, send email)
- Prompts — reusable prompt templates for common tasks
MCP servers are now supported in VS Code (Copilot), Cursor, Claude Desktop, and dozens of other tools. A single MCP server for Postgres, for example, lets any AI assistant query your database.
7. Production Considerations
7.1 The Prototype → Production Gap
Getting a demo working is easy. Making it reliable, fast, safe, and cost-effective at scale is where the real engineering happens:
Latency
Users expect sub-2-second responses. Use streaming, smaller models for simple tasks, caching frequent queries, and edge deployment for global low-latency.
Cost Control
Route simple queries to cheap/small models, complex ones to frontier models. Cache embeddings. Batch requests. Monitor per-query cost.
Reliability
LLMs are non-deterministic. Use structured output (JSON mode), validation, retries, fallback models, and comprehensive evals.
Safety
Guard against prompt injection, jailbreaks, and data leakage. Use input/output guardrails (NeMo Guardrails, Llama Guard), content filtering, PII detection.
7.2 Model Routing
Not every query needs GPT-4. Smart routing sends each request to the right model:
📊 Interactive: Model Routing Cost Comparison
See how routing queries to appropriate model tiers can reduce costs by 60-80%.
8. The New Developer Experience
AI hasn't replaced developers — it's changed what it means to develop software. Here's what matters now:
8.1 Skills That Matter More
- System design — understanding how components fit together is more important than ever when AI can generate individual components
- Problem decomposition — breaking complex problems into AI-solvable chunks
- Evaluation & debugging — knowing when AI output is wrong, and why
- Prompt engineering — communicating clearly with AI tools to get reliable results
- Domain expertise — AI amplifies what you already know. Deep knowledge in healthcare, finance, or security becomes more valuable, not less
8.2 The New Workflow
Before: Write Code
Open blank file → think → type → debug → refactor → test → commit. Most time spent on syntax and implementation details.
After: Direct Code
Describe intent → review AI output → refine → verify → test → commit. Most time spent on architecture decisions and verification.
The role is shifting from author to architect + editor. You spend more time understanding what to build and verifying it works correctly, less time on the mechanical act of writing each line.
8.3 Getting Started: Your First AI Project
If you've read all five parts and want to build something, here's a practical starting path:
- Week 1: Install Ollama, run
ollama run llama3, experiment with different prompts via the CLI - Week 2: Build a simple chatbot with the Ollama API (Python/FastAPI or Node/Express + a simple frontend)
- Week 3: Add RAG — load a few documents into ChromaDB, query them before generating answers
- Week 4: Add tool calling — let your chatbot search the web or query a database using function calling
- Week 5: Deploy it — host with Docker, add authentication, try streaming responses to the frontend
Each step builds on the last, and at the end you'll have a genuinely useful AI application with RAG, tools, and a real UI.
9. The Complete Arc — Five Parts, One Story
This series has covered the full sweep of AI — from medieval logic machines to autonomous coding agents:
- Part 1 — Logic to Statistics: The intellectual foundations, from Llull and Leibniz through Turing, LISP, AI winters, and the statistical revolution.
- Part 2 — Perception to Prediction: Neural nets, deep learning, CNNs, RNNs, attention, transformers, BERT, GPT, and the foundation model paradigm.
- Part 3 — Action to Understanding: Reinforcement learning, agents, chain-of-thought, open source vs. closed, alignment, and the AGI question.
- Part 4 — Under the Hood: Tokenizers, embeddings, training at scale, LoRA, RLHF, inference optimization, prompt engineering, and multimodal models.
- Part 5 — AI in the Wild: Real applications, RAG, vector databases, the developer toolkit, local models, agents, deployment, and building your first AI app.
Understanding AI isn't just about knowing what a transformer is. It's about understanding the full stack — from the math behind attention to the engineering of inference servers to the design of useful applications.
The field will keep moving. New architectures will emerge. New capabilities will surprise us. But the patterns we've covered — tokenize, embed, attend, generate, retrieve, act — will remain the foundation for a long time.
Now go build something.
Further Reading & Resources
- Lewis et al. — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (RAG paper, 2020)
- Anthropic — "Contextual Retrieval" blog post (2024)
- LangChain docs — docs.langchain.com
- LlamaIndex docs — docs.llamaindex.ai
- Ollama — ollama.ai
- HuggingFace Open LLM Leaderboard — model benchmarks and comparisons
- Anthropic MCP — modelcontextprotocol.io
- Simon Willison's blog — essential reading for practical AI development
- The Batch (Andrew Ng) — weekly AI newsletter with accessible explanations
← Part 1: Logic Machines to Linear Regression · Part 2: Neural Nets to Transformers · Part 3: Agents, Reasoning & AGI · Part 4: Under the Hood