Transformer Architecture with Examples
Let’s dive into the Transformer architecture and break down the dimensions of the input, input embeddings, and how data transforms through the model. I’ll provide a clear, detailed explanation of the full architecture, focusing on how the input evolves step-by-step. Since you’re asking about dimensions and transformations, I’ll assume you’re familiar with the basics but want a precise, end-to-end walkthrough.
Transformer Architecture Overview
The Transformer, introduced in "Attention is All You Need" (Vaswani et al., 2017), consists of an encoder and a decoder, both built from stacked layers. It’s designed for sequence-to-sequence tasks (e.g., translation), but I’ll describe the general architecture, noting dimensions at each step. For concreteness, I’ll use typical values like a model dimension ( d_{\text{model}} = 512 ) and a vocabulary size ( V = 30,000 ), though these can vary (e.g., BERT uses ( d_{\text{model}} = 768 ), GPT varies by size).
Step 1: Input

- Raw Input: A sequence of tokens (words, subwords, etc.) from a vocabulary. For example, a sentence like "The cat sleeps" might be tokenized into (["The", "cat", "sleeps"]).
- Dimensions: If the input sequence has length ( T ) (e.g., ( T = 3 ) for "The cat sleeps"), the input is a 1D tensor of token IDs:
- Shape: ( [T] ), e.g., ( [784, 231, 1509] ) (token IDs from the vocabulary).
- Batch Consideration: In practice, we process batches. For batch size ( B ), the input becomes:
- Shape: ( [B, T] ).
Step 2: Input Embeddings
- Transformation: Each token ID is mapped to a dense vector using an embedding layer (a lookup table).
- Embedding Matrix: A learnable matrix of shape ( [V, d_{\text{model}}] ), where ( V ) is the vocabulary size (e.g., 30,000) and ( d_{\text{model}} ) is the embedding dimension (e.g., 512).
- Output: Each token ID is replaced by its corresponding ( d_{\text{model}} )-dimensional vector.
- Dimensions: For a single sequence, the output is:
- Shape: ( [T, d_{\text{model}}] ), e.g., ( [3, 512] ).
- For a batch: ( [B, T, d_{\text{model}}] ), e.g., ( [B, 3, 512] ).
- Example: "The" (ID 784) → ( [0.1, -0.3, ..., 0.5] ) (a 512D vector).
Step 3: Positional Encodings
- Why: Transformers lack recurrence, so they need positional information to understand token order.
- Transformation: Add fixed or learned positional encodings to the input embeddings. These are vectors of the same size as the embeddings (( d_{\text{model}} )).
- Formula (fixed, sinusoidal):
- ( PE(pos, 2i) = \sin(pos / 10000^{2i / d_{\text{model}}}) )
- ( PE(pos, 2i+1) = \cos(pos / 10000^{2i / d_{\text{model}}}) )
- Where ( pos ) is the position (0 to ( T-1 )), and ( i ) is the dimension index (0 to ( d_{\text{model}}/2 - 1 )).
- Output: Input embeddings + positional encodings.
- Dimensions: Unchanged, still ( [B, T, d_{\text{model}}] ).
Step 4: Encoder
The encoder has ( N ) identical layers (e.g., ( N = 6 )). Each layer has two main sub-layers:
a) Multi-Head Self-Attention
- Inputs: ( [B, T, d_{\text{model}}] ) (e.g., ( [B, 3, 512] )).
- Mechanism: Compute queries (( Q )), keys (( K )), and values (( V )) using linear projections:
- ( Q = X W_Q ), ( K = X W_K ), ( V = X W_V ), where ( X ) is the input, and each ( W ) is ( [d_{\text{model}}, d_k] ) or ( [d_{\text{model}}, d_v] ).
- Typically, ( d_k = d_v = d_{\text{model}} / h ), where ( h ) is the number of heads (e.g., ( h = 8 ), so ( d_k = d_v = 64 )).
- Attention: ( \text{Attention}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d_k}) V ).
- ( QK^T ): ( [B, T, T] ) per head.
- Output per head: ( [B, T, d_v] ).
- Multi-Head: Concatenate ( h ) heads, then project with ( W_O ) (( [h \cdot d_v, d_{\text{model}}] )):
- Output: ( [B, T, d_{\text{model}}] ).
- Residual + Norm: Add input to output, then layer normalize.
b) Feed-Forward Network (FFN)
- Inputs: ( [B, T, d_{\text{model}}] ).
- Transformation: Per token, apply two linear layers with a ReLU:
- ( FFN(x) = \max(0, x W_1 + b_1) W_2 + b_2 ).
- ( W_1 ): ( [d_{\text{model}}, d_{\text{ff}}] ) (e.g., ( d_{\text{ff}} = 2048 )).
- ( W_2 ): ( [d_{\text{ff}}, d_{\text{model}}] ).
- Output: ( [B, T, d_{\text{model}}] ).
- Residual + Norm: Add input, then normalize.
- Encoder Output: After ( N ) layers, still ( [B, T, d_{\text{model}}] ).
Step 5: Decoder
The decoder also has ( N ) layers, with three sub-layers per layer. It generates the output sequence (length ( T' )) autoregressively.
a) Masked Multi-Head Self-Attention
- Inputs: Output embeddings (shifted right) + positional encodings, shape ( [B, T', d_{\text{model}}] ).
- Masking: Prevent attending to future tokens using a causal mask.
- Output: ( [B, T', d_{\text{model}}] ).
b) Multi-Head Attention (Encoder-Decoder)
- Inputs: Decoder’s ( Q ) (( [B, T', d_{\text{model}}] )), encoder’s ( K ) and ( V ) (( [B, T, d_{\text{model}}] )).
- Output: ( [B, T', d_{\text{model}}] ).
c) Feed-Forward Network
- Same as encoder: ( [B, T', d_{\text{model}}] ).
- Decoder Output: After ( N ) layers, ( [B, T', d_{\text{model}}] ).
Step 6: Output Projection
- Transformation: Linear layer maps ( d_{\text{model}} ) to vocabulary size ( V ).
- Weight matrix: ( [d_{\text{model}}, V] ).
- Output: ( [B, T', V] ) (logits over vocabulary).
- Final Step: Softmax to probabilities, ( [B, T', V] ).
Full Transformation Summary
- Input: ( [B, T] ) (token IDs).
- Embeddings + Positional: ( [B, T, d_{\text{model}}] ).
- Encoder (N layers): ( [B, T, d_{\text{model}}] ).
- Decoder Input: ( [B, T', d_{\text{model}}] ) (shifted output embeddings).
- Decoder (N layers): ( [B, T', d_{\text{model}}] ).
- Output: ( [B, T', V] ) (probabilities).
Example with Numbers
- Input: "The cat sleeps" (( B = 1, T = 3 )).
- Embeddings: ( [1, 3, 512] ).
- Encoder: ( [1, 3, 512] ).
- Decoder (target "Le chat dort", ( T' = 3 )): ( [1, 3, 512] ) → ( [1, 3, 30,000] ).