What are Foundation Models?

The massive, general-purpose AI models trained on everything — and how they became the platform layer of modern AI.

6 min read

In 2018, if you wanted AI to do something, you built a model for that specific thing. One model for translation. Another for summarization. Another for sentiment analysis. Each trained from scratch.

Now? You take one giant model and adapt it to do all of those things.

Foundation models are massive AI models trained on broad data that serve as the base for many different tasks.

GPT-4, Claude, Gemini, Llama — these are all foundation models.

Why "foundation"?

The term comes from a 2021 Stanford paper. The metaphor is architectural: these models are the foundation upon which you build everything else.

┌─────────────────────────────────────────────────────────────┐ │ │ │ THE OLD WAY THE FOUNDATION WAY │ │ ━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━ │ │ │ │ Translation Model ┌─────────────────┐ │ │ Summarization Model │ Foundation Model │ │ │ Sentiment Model └───────┬─────────┘ │ │ Q&A Model ┌───────┼─────────┐ │ │ Code Model ▼ ▼ ▼ │ │ Translate Summarize Code │ │ 5 models, trained Sentiment Q&A Anything │ │ separately, from scratch │ │ 1 model, adapted to many │ │ │ └─────────────────────────────────────────────────────────────┘

Instead of building from scratch every time, you start with a model that already understands language (or images, or code) at a deep level. Then you steer it toward your specific task.

What makes them "foundation"

Three things distinguish foundation models from regular AI models:

1. Scale of training

Foundation models are trained on massive, diverse datasets. Not "a million customer reviews" — more like "a significant fraction of the internet."

GPT-3's training data included:

Books (hundreds of thousands)
Wikipedia (all of it)
Web pages (billions)
Code repositories
Academic papers

Total: ~300 billion words

That's roughly 570GB of text. If you read 24/7, it would take you about 26,000 years to read it all.

2. General capability

A spam filter is great at detecting spam. It's useless at writing poetry. Foundation models can do both — and thousands of other tasks — because they learned from such diverse data.

This is called emergent capability. Individual abilities that weren't explicitly trained for just... appear. Train a model on enough text and it can suddenly do math, write code, translate between languages, and explain quantum physics. Nobody programmed these abilities in.

3. Adaptability

Foundation models are designed to be adapted. You can:

Prompt them — just ask in natural language
Fine-tune themFine-tuningCustomizing a pre-trained AI model on specific data to improve its performance for a particular task.Click to learn more → — train them further on your specific data
Build on top of them — use them as an API in your applications

This adaptability is why they're called "foundations." They're not the final product. They're the starting point.

The economics changed everything

Before foundation models, training a state-of-the-art AI was expensive but doable for a well-funded research lab. Now?

Training GPT-4 reportedly cost over $100 million. Training the next generation will likely cost billions.

This created a new dynamic in AI:

┌─────────────────────────────────────────────────────────────┐ │ │ │ WHO BUILDS WHAT │ │ ━━━━━━━━━━━━━━━ │ │ │ │ 🏢 A few companies train foundation models │ │ (OpenAI, Google, Anthropic, Meta) │ │ Cost: $100M - $1B+ │ │ │ │ 🏗️ Thousands of companies fine-tune them │ │ (for legal, medical, finance, etc.) │ │ Cost: $1K - $100K │ │ │ │ 🚀 Millions of developers build apps on top │ │ (chatbots, search, writing tools, agents) │ │ Cost: Pay per API call │ │ │ └─────────────────────────────────────────────────────────────┘

Foundation models became a platform layer — like operating systems for AI.

Types of foundation models

Large Language Models (LLMs)

Trained primarily on text. GPT-4, Claude, Llama, Gemini. They understand and generate human language.

Vision-Language Models

Handle both text and images. GPT-4V, Gemini, Claude 3. They can describe images, answer questions about photos, and reason across modalities.

Image Generation Models

Trained on text-image pairs. Stable Diffusion, DALL-E, Midjourney. They generate images from text descriptions.

Code Models

Trained heavily on code repositories. GitHub Copilot (based on Codex), Code Llama, DeepSeek Coder. They write, explain, and debug code.

Audio Models

Handle speech and music. Whisper (speech-to-text), MusicGen (music generation). They process and create audio.

The controversy

Foundation models aren't universally loved. Here's what critics worry about:

Concentration of power. Only a handful of companies can afford to train them. This gives those companies enormous influence over the direction of AI.

Training data issues. These models learn from the internet — including copyrighted books, artwork, personal data, and misinformation. Legal battles are ongoing.

Environmental cost. Training a single large model can emit as much carbon as five cars over their entire lifetimes. And that's just the training — inference (using the model) costs energy too.

Homogenization. If everyone builds on the same few foundation models, their biases and limitations propagate everywhere. One model's blind spots become the industry's blind spots.

Opacity. Nobody fully understands what these models learned or why they behave the way they do. That's concerning when they're being used for medical diagnosis, legal advice, and hiring decisions.

Open vs. closed

A major debate in the foundation model world:

Closed models (GPT-4, Claude, Gemini): Only available through APIs. The companies control access, pricing, and updates. You can't see the model weights or modify the model directly.

Open modelsOpen Source AIAI models and systems whose code, weights, or architecture are publicly available.Click to learn more → (Llama, Mistral, Falcon): Released publicly. Anyone can download, modify, fine-tune, and deploy them. You control your own infrastructure.

Both approaches have tradeoffs. Closed models are often more capable but create vendor lock-in. Open models give you control but require more technical expertise to deploy.

What makes a good foundation model?

Capability: Can it perform well across a wide range of tasks?

Reliability: Does it give consistent, accurate outputs?

Efficiency: How much compute does it need for inference?

Safety: Does it avoid harmful outputs? Can it be steered?

Adaptability: How well does it respond to fine-tuning and prompting?

The best foundation models balance all of these. Raw capability without safety is dangerous. Safety without capability is useless.

The future

Foundation models are evolving fast:

Multimodal by default: Future models will natively handle text, images, audio, video, and code in a single model
Longer memory: Current context windows keep growing — from 4K tokens to 200K and beyond
Agent capabilities: Models that can use tools, browse the web, write code, and take actions in the real world
Smaller and smarter: Distillation and efficiency techniques are making smaller models surprisingly capable
Specialized foundations: Models trained specifically for medicine, law, science, and other domains

The bottom line: Foundation models are the base layer of modern AI. A few companies build them, thousands fine-tune them, and millions build applications on top. They're not perfect — they're expensive, opaque, and controversial — but they've fundamentally changed what's possible with AI.

Foundation models are powerful but generic. To make them work for specific tasks, you need to adapt them. Learn how: What is Fine-Tuning?

Get new explanations in your inbox

Every Tuesday and Friday. No spam, just AI clarity.