Every software engineer today works alongside LLMs. But understanding the mechanics — not just the API — makes you a dramatically more effective builder. This article cuts through the hype and explains what's actually happening inside a large language model.

Step 1: Tokenization

LLMs don't see characters or words — they see tokens. A token is roughly 4 characters of English text. GPT-4 uses Byte-Pair Encoding (BPE): the most common subword sequences become single tokens, reducing the vocabulary to ~100,000 entries while handling any text.

🔢

Why tokens matter for cost

GPT-4 charges per token. "The quick brown fox" = 5 tokens. A 10-page PDF = ~8,000 tokens. Code tends to tokenize less efficiently than prose — plan accordingly.

Step 2: Embeddings

Each token ID maps to a high-dimensional vector (e.g., 12,288 dimensions in GPT-3). These embeddings encode semantic meaning — "king" and "queen" will be closer in this space than "king" and "table". The model learns these during pre-training.

Step 3: Self-Attention (The Core Mechanism)

Self-attention lets every token "look at" every other token and decide how much to borrow from each. For each token, we compute three vectors:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What will I contribute if selected?"
Python (simplified)
import torch, torch.nn.functional as F def attention(Q, K, V, mask=None): d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / d_k**0.5 if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) weights = F.softmax(scores, dim=-1) return torch.matmul(weights, V), weights

Pre-training: Next Token Prediction

LLMs are trained on trillions of tokens to predict the next token given all previous tokens. This simple objective, at massive scale, forces the model to learn grammar, facts, reasoning patterns, and code — all implicitly encoded in the weights.

RLHF: Making Models Helpful

Raw pre-trained models are great at completion but bad at following instructions. Reinforcement Learning from Human Feedback (RLHF) fixes this in three steps:

  1. Supervised Fine-Tuning (SFT): Train on expert-written (prompt, response) pairs.
  2. Reward Model Training: Human raters rank multiple responses; a reward model learns to predict human preference.
  3. PPO Fine-Tuning: Use the reward model as a signal to further tune the LLM via reinforcement learning.
🎯

DPO: Simpler than RLHF

Direct Preference Optimisation (DPO) achieves similar alignment results without a separate reward model. It's now the preferred approach for most fine-tuning pipelines.