Why Transformers Matter
The transformer architecture, introduced in the 2017 paper "Attention Is All You Need", fundamentally changed how we process sequential data. Before transformers, recurrent neural networks (RNNs) and LSTMs were the standard for text, speech, and time series — but they had a critical limitation: they process data sequentially, one token at a time.
Transformers process all tokens in parallel using a mechanism called self-attention. This makes them dramatically faster to train and better at capturing long-range dependencies.
The Core Idea: Self-Attention
Self-attention lets each word in a sentence look at every other word and decide how much to "attend" to it. When processing the sentence "The cat sat on the mat because it was tired," self-attention helps the model understand that "it" refers to "cat," not "mat."
How It Works
For each token, three vectors are computed:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
The attention score between two tokens is the dot product of the query of one token with the key of another, scaled and passed through softmax:
Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V
The scaling factor √d_k prevents the dot products from growing too large, which would push softmax into regions with tiny gradients.
Multi-Head Attention
Instead of computing attention once, transformers use multiple attention "heads" in parallel. Each head learns to attend to different types of relationships:
- One head might learn syntactic relationships (subject-verb)
- Another might learn semantic relationships (synonyms)
- Another might learn positional patterns (adjacent words)
The outputs of all heads are concatenated and projected through a linear layer.
Positional Encoding
Since transformers process all tokens in parallel, they have no inherent notion of word order. Positional encodings are added to the input embeddings to inject position information. The original paper uses sinusoidal functions:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Modern models (GPT, LLaMA) often use learned positional embeddings or rotary position embeddings (RoPE) instead.
The Full Architecture
A transformer block consists of:
- Multi-head self-attention
- Add & normalize (residual connection + layer norm)
- Feed-forward network (two linear layers with activation)
- Add & normalize (another residual connection)
Stack 6-96 of these blocks, and you get models like BERT (encoder-only), GPT (decoder-only), or T5 (encoder-decoder).
Encoder (BERT): sees all tokens at once, great for understanding tasks (classification, NER). Decoder (GPT): generates tokens left-to-right, great for generation. Encoder-Decoder (T5): input goes through encoder, output through decoder — good for translation, summarization.
Why Transformers Won
- Parallelism: RNNs must process sequentially; transformers process all positions simultaneously on GPUs.
- Long-range dependencies: Attention connects any two tokens directly, regardless of distance.
- Scalability: Transformers scale efficiently with more data and parameters — the foundation of the scaling laws that drive modern LLMs.
The Scaling Revolution
The transformer's ability to scale is what enabled the LLM revolution. GPT-3 (175B parameters), PaLM (540B), and beyond showed that simply making transformers bigger and training them on more data produces emergent capabilities. This insight — that architecture + scale = intelligence — is the foundation of modern AI.
"Attention is all you need" turned out to be one of the most prophetic paper titles in computer science history.