Self Attention & Transformer
As we know, Attention is all you need4 introduced this concept which is building block of Transformer.
So let's see in detail.
TOC
- Transformer Architecture
- RNN for self attention
- Self attention architecture
- QKV
- Positional representation
- Future masking
- Transformer Block
- Multihead self attention
- Feedforward layer
- Layer normalization
- Residual connection
- Logit scaling with Softmax
- Transformer Encoder
- Transformer Decoder
- Cross attention
Transformer Architecture

➤ RNN for self attention

Recurrent Neural Network is able to conditioning all the previous words in corpus. For time stamp \(t\) at hidden layer, there are two inputs \({x_t}\) and output of previous layer \({h}_{t-1}\) which is multiplied by weight matrix \({W}^{(hh)}\) and \({W}^{(hx)}\) (as U in fig) to product output feature \({h}_{t}\), which are further mutiplies by a weight matrix \({W}^{S}\) going through Softmax to predict output \({y}\) (next word).
To put it all together
But the issue with RNNs is the difficulty with which distant tokens in a sequence can interact with each other.
➤ Self attention architecture
Attention, is a method for taking a query (what we are looking for), and looking up information in a key-value store by picking the values of the key(what we already have in our corpus) most likely matches the query. By averaging overall values, putting more weight on those which correspond to the keys more like the query.
A self-attention layer maps input sequences (x1,...,xn) to output sequences of the same length (a1,...,an). When processing each item in the input, the model has access to all of the inputs up to and including the one under consideration, but no access to information about inputs beyond the current one (No future tokens).
- QKV

A token \({x}_{i}\) from sequence \({x}_{1:n}\), and query \({q}_{i}\) for matrix Q. Then for each token \(x\) in sequence \({[x_1, x_2, ... x_n]}\), we define both key and value with two weight matrices.
So what we basically do is take our element \(x_{i}\) and look in its own sequence (it's like looking at our own room in house we are currently standing) with the help of K, Q, V matrices with Softmax.
- Positional representation
In the self-attention operation, there's no built-in notion of order. Ther are certain things to consider is 1. the representation of x is not position dependent; it's just Ew for whatever word w 2. there's no dependence on position in the self-attention operations.
Therefore to represent positions in self attention we use vectors that are already position-dependent as inputs. So we add embedded representation of the position of a word (P) to its word embedding. Other way is to change self attention operation itself by adding linear bias but it seems little complex so we will be using positinal encoding here.
- Future masking
To stop the current token looking into the future tokens, we used masking by zeroed out(-∞) next tokens to eliminate further information.
➤ Transformer Block

Transformer block includes Feedforward layer, Normalization layer, and Residual connection with Mulrihead attention.
Computation inside transformer block
Steps | Description |
---|---|
XS = X + selfattention(X) | Input X of shape [N, d] |
XL = LayerNorm(XS) | Apply layer normalization |
XF = FFN(XL) | Apply feedforward nn |
XA = XF + XL | Concate outputs |
HO = LayerNorm(XA) | Apply layer normalization to get final Head output |
- Multihead self attention

In mutlihead attention, each of the multihead self-attention layers is provided with its own set of key, query and value weight matrices. The outputs from each of the layers are concatenated and then projected to d, thus producing an output of the same size as the input so the attention can be followed by layer norm and feedforward and layers can be stacked.
- Feedforward layer
It's common to apply feed-forward network independently to each word representation after self attention. The feedforward layer contains N position-wise networks, one at each position. Each is a fully-connected 2-layer network, i.e., one hidden layer, two weight matrices. The weights are the same for each position, but the parameters are different from layer to layer. Unlike attention, the hidden dimension of the feed-forward network is substantially larger and are independent for each position and so it is efficient to do lot of computation and parameters that can work parallel.
- Layer normalization
Layer normalization (layer norm) is one of many forms of normalization that can be used to reduce uninformative variation in the activations at a layer, providing a more stable input to the next layer. Layer norm is a variation of the standard score, or z-score, from statistics applied to a single vector in a hidden layer. The input to layer norm is a single vector, for a particular token position i, and the output is that vector normalized. Thus layer norm takes as input a single vector of dimensionality d and produces as output a single vector of dimensionality d.
To compute layer norm:
- computes statistics across the activations at a layer to estimate the mean and variance of the activations
- normalizes the activations with respect to those estimates, while optionally learning (as parameters) an elementwise additive bias and multiplicative gain by which to sort of de-normalize the activations in a predictable way.
- Residual connection
Residual connections are connections that pass information from a lower layer to a higher layer without going through the intermediate layer. Allowing information from the activation going forward and the gradient going backwards to skip a layer improves learning and gives higher level layers direct access to information from lower layer. Residual connections can be implemented by simply adding a layer's input vector to its output vector before passing it forward.
➤ Logit scaling with Softmax
The dot product part comes from the fact that we're computing dot products \(q_{i}^{T}\:k_{j}\). The intuition of scaling is that, when the dimensionality d of the vectors are dotting grows large, the dot product of even random vectors grows roughly as \(\sqrt{d}\). So, we normalize the dot products by \(\sqrt{d}\) to stop scaling.
➤ Transformer Encoder

A Transformer Encoder takes a single sequence \(w_{1:n}\), and performs no future masking at this stage, so even the first token can see the whole future of the sequence when building its representation. It embeds the sequence with E to make \(x_{1:n}\) input format, adds the position representation, and then applies a stack of independently parameterized Encoder Blocks, each of which consisting of multihead attention with Add & Norm, and feed-forward with Add & Norm. So, the output of each Block is the input to the next.
➤ Transformer Decoder

Now to solve the problem with encoder (as we try to build autoregressive model), we use transformer decoder with future masking at each self attention as we seen above.
- Cross attention
As in name suggests, unlike self attention(recall looking in same room within house) which looks within own sequence, cross attetion uses one sequence to define the keys and values of self-attention from encoder, and another sequence (from other room) to define the queries (generate intermediate representation of output sequence).
Example
References
-
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/readings/cs224n-self-attention-transformers-2023_draft.pdf ↩
-
https://web.stanford.edu/~jurafsky/slpdraft/10.pdf ↩
-
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/readings/cs224n-2019-notes05-LM_RNN.pdf ↩
-
https://arxiv.org/abs/1706.03762 ↩