How does AI training actually work?

Everyone says AI 'learns from data.' But what does that actually mean? Here's what happens during training, no PhD required.

4 min read

"The model was trained on the internet."

You've heard this. But what does it mean? Did the AI read websites like you read books? Does it remember them?

Not exactly. Training is weirder and simpler than that.

The setup: a very dumb student

Before training, a neural networkNeural NetworkA computing system inspired by biological brains, made of interconnected nodes that learn patterns from data.Click to learn more → is just a bunch of random numbers. Millions or billions of them, called parametersParametersThe numerical values a neural network learns during training — GPT-4 has over a trillion.Click to learn more → or weightsWeightsNumbers that determine the strength of connections between neurons in a neural network.Click to learn more →.

These numbers determine how the network transforms input into output. Random numbers = random garbage output.

Training is the process of adjusting these numbers until the output is useful.

The core loop: guess, check, adjust

Training works like this:

1. Show it an example

Give the model some input. For a language model: "The cat sat on the ___"

2. It makes a guess

The model processes the input through its layers and outputs a prediction: "elephant" (remember, it starts dumb).

3. Check how wrong it was

The correct answer was "mat." The model said "elephant." That's very wrong.

We calculate a "loss": a number representing how bad the prediction was. Higher loss = more wrong.

4. Figure out who's to blame

Here's the clever part: backpropagationBackpropagationThe algorithm that teaches neural networks by propagating errors backward to adjust weights.Click to learn more →.

We trace backward through the network, asking at each step: "Which weights contributed to this wrong answer? How should each one change to make the answer less wrong?"

This gives us a direction to adjust each weight. Make this one a bit bigger, that one a bit smaller.

5. Nudge the weights

We adjust all the weights slightly in the direction that reduces the error. This is gradient descentGradient DescentAn optimization algorithm that finds the best parameters by following the slope of the error.Article coming soon: following the slope toward better answers.

6. Repeat. A lot.

Do this billions of times with billions of examples. The weights gradually shift from random garbage to useful patterns.

What the model "learns"

After training, what's in the model?

Not the training data. The model doesn't "remember" specific examples like a database.

Instead, the weights encode patterns, statistical regularities in the data:

"After 'the cat sat on the', words like 'mat', 'floor', 'couch' are common"
"Sentences usually have a subject, then a verb"
"The word 'Paris' often appears near 'France', 'Eiffel', 'city'"

The model is a compressed, approximate representation of patterns in language.

Scale changes everything

A small model might learn: "dog" and "cat" are both animals.

A large model learns: the subtle difference in how people talk about dogs vs cats, that "dog person" and "cat person" are personality types, that dogs appear in loyalty contexts and cats in independence contexts.

More parameters = more nuance. More training data = more patterns.

GPT-4 has over a trillion parameters trained on trillions of words. It learned patterns humans didn't even know existed.

The training data question

"Trained on the internet" means:

Web pages (filtered for quality)
Books (lots of them)
Wikipedia
Code repositories
Academic papers
Social media posts

The model saw a lot of human text. Good and bad. True and false.

This is why AI can write poetry AND generate misinformation. It learned patterns from all of it.

What training costs

Training a large model requires:

Thousands of GPUsGPUGraphics Processing Unit — specialized chips that excel at parallel computations needed for AI.Click to learn more → running for months
Massive datasets (terabytes of text)
Electricity (enough to power a small town)
Expertise (the teams are small but elite)

GPT-4 reportedly cost $100+ million to train. Newer models cost more.

This is why only a few companies can make frontier models: Microsoft, Google, Meta, Anthropic, a few others.

Training vs fine-tuning vs prompting

Pre-training: The expensive part. Learning general patterns from massive data.

Fine-tuningFine-tuningCustomizing a pre-trained AI model on specific data to improve its performance for a particular task.Article coming soon: Taking a pre-trained model and training it more on specific data. Cheaper. Makes it better at particular tasks.

Prompting: No training at all. Just asking the model in clever ways. Free, instant, limited.

Most people only do prompting. Some companies fine-tune. Very few pre-train from scratch.

The weird truth

Here's what surprises people: we don't fully understand why training works.

We know the math. We know the procedure. But why do billions of random numbers, adjusted by simple rules, produce something that can write essays and code?

There are theories. But nobody has a complete answer.

The models are trained, not designed. They emerge from the process.

What training can't do

Training can only teach patterns that exist in the data. The model can't:

Learn facts that aren't in training data
Reason about things it never saw examples of
Update its knowledge after training ends

This is why models hallucinateHallucinationWhen AI confidently generates false or made-up information.Click to learn more →. They generate pattern-plausible text, not verified truth.

Training is how AI gains capabilities. But what happens when you actually use the trained model? That's inference, and it has its own interesting challenges.

Get new explanations in your inbox

Every Tuesday and Friday. No spam, just AI clarity.