Why Transformers Matter

The transformer architecture, introduced in the 2017 paper "Attention Is All You Need", fundamentally changed how we process sequential data. Before transformers, recurrent neural networks (RNNs) and LSTMs were the standard for text, speech, and time series — but they had a critical limitation: they process data sequentially, one token at a time.

Transformers process all tokens in parallel using a mechanism called self-attention. This makes them dramatically faster to train and better at capturing long-range dependencies.

The Core Idea: Self-Attention

Self-attention lets each word in a sentence look at every other word and decide how much to "attend" to it. When processing the sentence "The cat sat on the mat because it was tired," self-attention helps the model understand that "it" refers to "cat," not "mat."

How It Works

For each token, three vectors are computed:

The attention score between two tokens is the dot product of the query of one token with the key of another, scaled and passed through softmax:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

The scaling factor √d_k prevents the dot products from growing too large, which would push softmax into regions with tiny gradients.

Multi-Head Attention

Instead of computing attention once, transformers use multiple attention "heads" in parallel. Each head learns to attend to different types of relationships:

The outputs of all heads are concatenated and projected through a linear layer.

Positional Encoding

Since transformers process all tokens in parallel, they have no inherent notion of word order. Positional encodings are added to the input embeddings to inject position information. The original paper uses sinusoidal functions:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Modern models (GPT, LLaMA) often use learned positional embeddings or rotary position embeddings (RoPE) instead.

The Full Architecture

A transformer block consists of:

  1. Multi-head self-attention
  2. Add & normalize (residual connection + layer norm)
  3. Feed-forward network (two linear layers with activation)
  4. Add & normalize (another residual connection)

Stack 6-96 of these blocks, and you get models like BERT (encoder-only), GPT (decoder-only), or T5 (encoder-decoder).

Encoder vs Decoder

Encoder (BERT): sees all tokens at once, great for understanding tasks (classification, NER). Decoder (GPT): generates tokens left-to-right, great for generation. Encoder-Decoder (T5): input goes through encoder, output through decoder — good for translation, summarization.

Why Transformers Won

The Scaling Revolution

The transformer's ability to scale is what enabled the LLM revolution. GPT-3 (175B parameters), PaLM (540B), and beyond showed that simply making transformers bigger and training them on more data produces emergent capabilities. This insight — that architecture + scale = intelligence — is the foundation of modern AI.

"Attention is all you need" turned out to be one of the most prophetic paper titles in computer science history.