As a successful frontier in the course of research towards artificial intelligence, **Transformers** are considered novel deep feed-forward artificial neural network architectures that leverage self-attention mechanisms and can handle long-range correlations between the input-sequence items. Thanks to their massive success in the industry and academic research, bountiful transformer architectures — a.k.a. X-formers — have been proposed by researchers since their inception in 2017 by Vaswani et al. [3], and have been adopted in a substantial number of domains such as — initially was proposed for — natural language processing (NLP), computer vision (CV), audio and speech processing, chemistry, and life sciences; they can achieve SOTA performances in the disciplines mentioned earlier. In this article, I have explained the transformer architecture through underlying math, python code implementation, and visualization of different layers. End-to-end examples are available on the TransformerX library repository on GitHub.

Lower-level concepts such as attention mechanisms and terminologies related to encoder-decoder models are the underlying ideas of Transformers. Therefore, I have provided a brief summary of these approaches.

Attention is the allocation of a cognitive resource scheme with limited processing power** **[1].

The general idea behind attention as proposed by Bahdanau et al. [2] is that it searches for the most relevant information located in different positions in the input sequence when translating a word in each step. In the next step, it generates translations for the source token (word) wrt. 1) the context vector of these relevant positions and 2) previously generated words, simultaneously.

They can be classified into various categories based on several criteria such as:

**The softness of attention:**

1. Soft 2. Hard 3. Local 4. Global**Forms of input feature:**

1. Item-wise 2. Location-wise**Input representation:**1. Co-attention 2. Self-attention 3. Distinctive attention 4. Hierarchical attention**Output representation:**

1. Multi-head 2. Single output 3. Multi-dimensional

If you feel attention mechanisms are in uncharted territory, I recommend reading the following article:

The base transformer [3] architecture, consists of two main building blocks i.e. an encoder and a decoder block. The encoder generates an embedding vector 𝒁 = (𝒛₁*, …, *𝒛ₙ) from an inputs representation sequence (𝒙₁*, …, *𝒙ₙ) and passes it to the decoder to generate the output sequence (𝒚₁*, …, *𝒚ₘ). Prior to generating an output at each step, the 𝒁 vector is fed into the decoder and hence the model is auto-regressive.

## 3.1. Encoder and decoder components

Similar to sequence-to-sequence models, the Transformer uses an encoder-decoder architecture.

## 3.1.1. Encoder

The encoder is simply a stack of multiple components or layers — 𝑵* *is 6 in the original paper — which themselves are a set of two sub-layers i.e. a multi-head self-attention block and a simple, position-wise FC FFN (fully connected feed-forward network). To enable a deeper model, researchers have exercised a residual connection by wrapping each of the two sublayers followed by layer normalization. Therefore, the output of each sub-layer is *LayerNorm(*𝒙* + Sublayer(*𝒙*))* and *Sublayer(*𝒙*)* is a function implemented within itself. The output dimension of all sub-layers, as well as embeddings, is 𝒅*_model = 512*.

Implementation of a Transformer encoder block:

## 3.1.2. Decoder

Apart from the sub-layers used in the encoder, the decoder applies multi-head attention over the outputs of the encoder component. Like the encoder, the residual connections are attached to the sub-layers followed by layer normalization. To guarantee the fact that the predictions for the position 𝒊 can depend only on previously known positions, another modification is applied to the self-attention sub-layer to prevent positions from attending to other positions along with offsetting the output embeddings by one position.

Implementation of a Transformer decoder block:

## 3.2. Modules in the transformer

Next, I will discuss the elemental components that comprise the original transformer architecture.

- Attention modules
- Position-wise feed-forward networks
- Residual Connection and Normalization
- Positional encoding

## 3.3. Attention modules

The transformer integrates Query-Key-Value (QKV) concept from information retrieval with attention mechanisms

- Scaled dot-product attention
- Multi-head attention

## 3.3.1. Scaled dot-product attention

The scaled dot-product attention is formulated as:

where 𝑲 ∈ ℝ^𝑀×𝐷𝑘, 𝑸 ∈ ℝ^ 𝑵 ×𝐷𝑘, and 𝑽 ∈ ℝ^ 𝑴×𝐷𝑣 are representation matrices. The length of keys (or values) and queries are denoted by 𝑴 and 𝑵 respectively and their dimensions are represented by 𝐷𝑘 and 𝐷𝑣. The matrix 𝑨 in the eq. 1 is usually called the attention matrix. The reason they have used dot-product attention instead of additive attention, which computes the compatibility function using a feed-forward network with a single hidden layer, is the speed and space efficiency in practice thanks to the matrix multiplication optimization techniques. Nonetheless, there is a substantial drawback with the dot-product for large values of 𝐷𝑘 which pushes the gradients of the softmax function to minuscule gradients. To stifle the gradient vanishing issue of the softmax function, the dot-products of the keys and queries are divided by the square root of 𝐷𝑘, and by virtue of this fact, it is called scaled dot-product.

Implementation of a dot-product attention block:

## 3.3.2. Multi-head attention

Introducing multiple attention heads instead of a single attention function, Transformer linearly projects the 𝐷𝑚-dimensional original queries, keys, and values to 𝐷𝑘, 𝐷𝑘, and 𝐷𝑣 dimensions with different, learned linear projections *h* times, respectively; through which, the computation of the attention function(eq. 1) on these projections can be performed in parallel, yielding 𝐷𝑣-dimensional output values. The model then concatenates them and produces a 𝐷𝑚-dimensional representation.

where

The projections are 𝑾𝑸ᵢ ∈ ℝ^d_model×dk, 𝑾𝑲ᵢ ∈ ℝ^d_model×dk, 𝑾𝑽ᵢ ∈ ℝ^d_model×dv, and 𝑾𝒐 ∈ ℝ^h*dv×d_model matrices.

This process enables the Transformer to jointly attend to different representation subspaces and positions. To make it more tangible, for a specific adjective, one head might capture the intensity of the adjective, while another one might attend to its negativity and positivity.

Implementation of Multi-head attention

As it can be seen, the multi-head attention has three hyperparameters that determine the tensor dimensions:

- The number of attention heads
- Model size (embedding size): the length of the embedding vector.
- Query, key, and value size: Query, key, and value weight sizes used by linear layers which output queries, keys, and values matrices

## 3.4. Attention variants in the Transformer

Three different ways to use attention have been employed in the original Transformer paper which are distinct in terms of the way the keys, queries, and values are fed into the attention function.

- Self-attention
- Masked Self-attention (autoregressive or causal attention)
- Cross-attention

## 3.4.1. Self-attention

All keys, queries, and values vectors come from the same sequence, in the case of Transformer, the encoder’s previous step outputs, allowing each position the encoder to simultaneously attend to all positions in its own previous layer i.e. 𝑸 = 𝑲 = 𝑽 = 𝑿 (previous encoder outputs).

## 3.4.2. Masked Self-attention (autoregressive or causal attention)

Despite the encoder layer, in the self-attention of the decoder, the queries are confined to their preceding key-value pairs positions as well as their current position in order to maintain the auto-regressive property. This can be implemented by masking the invalid positions and setting them to negative infinite i.e. 𝑨 𝒊𝒋 = −∞ if 𝒊 < 𝒋.

## 3.4.3. Cross-attention

This type of attention obtains its queries from the previous decoder layer whereas the keys and values are acquired from the encoder yields. This is basically the attention used in the encoder-decoder attention mechanisms in sequence-to-sequence models. In other words, cross-attention combines two different embedding sequences with the exact dimensions which derive its queries from one sequence and its keys and values from the other. Let’s assume *S1* and *S2* are two embedding sequences, the cross-attention obtains its keys and values from *S1* and its queries from *S2* then calculates the attention scores and produces the results sequence with the length of *S2*. In the case of the Transformer, the keys and values are derived from the encoder and the queries from the previous-step decoder outputs.

It is worth mentioning that the two input embedding sequences can be of different modalities (i.e. text, image, audio, etc.).

## 3.5. Position-wise FFN

On top of each sub-layer in the encoder and decoder, a position-wise fully connected feed-forward network is applied to each position individually and exactly in the same way, however, the parameters are distinct from layer to layer. It is a couple of linear layers with a ReLU activation function in between; it is identical to a two-layer convolution with kernel size 1.

where *x *is the previous layer’s output, and 𝑾₁ ∈ ℝ^𝐷_*model* × 𝐷𝑓, 𝑾₂ ∈ ℝ^𝐷𝑓 × 𝐷_*model*, 𝒃₁ ∈ ℝ^𝐷𝑓, 𝒃₂ ∈ ℝ^𝐷_*model* are trainable matrices, and the inner-layer 𝐷𝑓 is generally set to be larger than 𝐷_*model (*in the case of the Transformer 𝐷_*model=512, and *𝐷𝑓=2048).

Implementation of position-wise FFN:

## 3.6. Residual connection and normalization

Wrapping each module with residual connections enables deeper architectures while avoiding gradient vanishing/explosion. Therefore, the Transformer employs residual connections around modules followed by a layer normalization. It can be formulated as follows:

- 𝒙 ′ = LayerNorm(SelfAttention(𝑿) + 𝑿)
- 𝒙 = LayerNorm(FFN(𝒙
*’*) + 𝒙*’*)

Implementation of residual connection and normalization:

## 3.7. Positional encoding

Researchers in the Transfomer used an interesting idea to inject a sense of ordering into the input tokens since it has no recurrence or convolution. Absolute and relative positional information can be used to imply the sequence order of the inputs, which can be learned or fixed. The summation process between matrices requires matrices of the same size, thus, the positional encoding dimensions are identical to those of the input embeddings. They are infused into the input encodings at the bottoms of the encoder and decoder modules. Vaswani et al. [3] used fixed positional encodings with the help of sine and cosine functions — however, they experimented with relative positional encoding and realized that in their case it produced almost the same results [4]. Let 𝑿 be the input representation that contains n tokens of *d*-dimensional embeddings. The positional encoding produces 𝑿 + 𝑷, where 𝑷 is a positional embedding matrix of the same size. The element on the ith row and *(2*𝒋*)th* or *(2*𝒋*+1)th* column is:

and

In the positional embedding matrix P, the rows represent the tokens’ positions in the sequence and the columns denote the different positional encoding dimensions.

I have depicted the differences between 4 columns in the matrix 𝑷 in the following visualization. Notice the distinct frequencies for different columns.

## 3.7.1. Absolute positional information

In the type of positional encoding, the frequency rate alternates based on the position of the element. By way of example, look at the following binary encodings; the numbers on the least valuable positions (right side) fluctuate more frequently while other numbers with more valuable positions have fewer fluctuations with regard to their position, i.e the most valuable position is more stable.

`0 `**->** 000

1 **->** 001

2 **->** 010

3 **->** 011

4 **->** 100

5 **->** 101

6 **->** 110

7 **->** 111

## 3.7.2. Relative positional information

Along with the above positional encoding, another method is to learn to attend by relative positions. For any fixed position 𝛿, the positional encoding at 𝛿+𝒊 can be derived by linearly projecting it at position 𝒊*.* Let *Ψ*=1/(10000^(2𝒋/d)), any pair of eq. 4 and eq. 5, can be linearly projected to positions at 𝛿+𝒊 for any fixed offset 𝛿:

[1] J. R. Anderson, 2005, Cognitive Psychology and Its Implications, Worth Publishers, 2005.

[2] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: ICLR.

[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.

[4] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.

[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[6] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.

[7] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.