Learn about Architecture, Training details, and Results of DALL-E.

In this article, I will explain in detail the architecture, results, and optimization techniques of the OpenAI’s DALL-E model.

DALL-E, a groundbreaking model introduced in 2021 by OpenAI, revolutionized the realm of generative AI by seamlessly bridging the gap between textual descriptions and image synthesis. This innovative model, named after a fusion of Salvador Dali and Pixar’s WALL-E, harnessed the power of transformers to generate intricate and imaginative images from textual prompts. In this comprehensive exploration, we delve into the intricacies of DALL-E’s working, training details, and the dataset that fueled its creative prowess. DALL-E uses the transformer to model text and image tokens as a single stream of data. The two problems related to using such an approach are that:

- Pixels as image tokens take up too much memory
- Likelihood objectives prioritize short-range dependencies between pixels

The solution to these problems is to use a 2 stage training process.

- Learning the Visual Codebook
- Learning Prior Distribution

The first stage of the training process involves learning the visual codebook vectors. DALL-E uses Discrete Variational Autoencoder (dVAE) for this step. dVAE is a variant of traditional Variational Autoencoder (VAE) that operates in a discrete latent space. It is similar to VQ-VAE but uses distribution instead of nearest neighbor. This architecture is particularly suited for scenarios where data possesses categorical or symbolic attributes.

The encoder of dVAE is responsible for mapping the input data to discrete codes in the latent space. The architecture of the encoder is designed to transform the input data into probabilities over discrete categories. This involves employing neural networks with suitable activation functions, such as softmax or Gumbel-Softmax, to ensure that the encoded probabilities reflect the distribution of discrete attributes present in the data. The encoder begins with an input layer that receives the raw data. This could be textual, categorical, or any form of data that exhibits discrete attributes. The input data is then passed through a series of hidden layers, where each layer employs activation functions that ensure the outputs represent probabilities over discrete categories. These activations play a crucial role in modeling the data distribution in the latent space. The final layer of the encoder produces the discrete code probabilities. Each neuron in this layer corresponds to a distinct category in the latent space. The activation values of these neurons represent the likelihood of the input data belonging to the respective categories.

The decoder network in the dVAE is responsible for reconstructing the original data from the generated discrete codes. Its architecture mirrors the encoder’s structure but operates in reverse, aiming to transform discrete probabilities back into data points. The discrete latent space is the crux of the dVAE’s architecture. It comprises distinct categories, each corresponding to a meaningful attribute or feature present in the data. The discrete codes generated by the encoder serve as indices that point to specific categories in this space. This representation enables the model to capture categorical and symbolic information, making it interpretable and valuable for various applications.

DALL-E uses Gumbel-Softmax relaxation technique.It addresses the challenge of sampling discrete variables in a continuous and differentiable manner. The Gumbel-Softmax relaxation is based on two main components: the Gumbel distribution and the softmax function.

The Gumbel distribution is a continuous probability distribution that is often used to model the randomness of discrete choices. It involves two key parameters, α and β, which influence the shape and location of the distribution. The Gumbel distribution is known for its “extreme value” property, where it can be transformed to simulate the sampling of discrete values. The softmax function is used to transform a vector of real values into a probability distribution. It ensures that the output values are non-negative and sum up to 1, thus representing the probabilities of different categories.

The Gumbel-Softmax relaxation technique involves two steps. In the first step, Gumbel-distributed random variables are sampled for each category. These random variables introduce a form of randomness that simulates the discrete nature of the variable. In the second step, the sampled Gumbel variables are transformed using the softmax function. This transforms the Gumbel-distributed values into a probability distribution over the categories.

The temperature parameter (often denoted as τ) in the Gumbel-Softmax relaxation affects the degree of “softness” in the relaxation. A higher temperature results in softer samples that are closer to a uniform distribution over the categories, introducing more randomness. Conversely, a lower temperature produces harder samples that approach one-hot categorical samples, leading to less randomness and more deterministic selections. The temperature annealing schedule involves adjusting the temperature parameter over the course of training. Typically, the temperature starts high and is gradually reduced during training.

This step involves the usage of a Transformer architecture. To learn the prior distribution, DALL-E employs a dataset containing pairs of text and corresponding images. During training, the transformer takes the caption of the image as input and it has to learn to output codebook vectors in an autoregressive fashion. In this way, transformer predicts the distribution of next token and this process of sampling distribution repeats until 1024 image tokens are produced. Then these 32 x 32 image tokens are taken and fed into the decoder. The decoder which was trained during the first stage generates a new image.

The transformer in the DALL-E model utilizes BPE encoding to convert lowercase captions into 256 text tokens, utilizing a vocabulary size of 16,384. For image tokens, a resolution of 32×32 is employed, with a vocabulary size of 8192. The model’s training employs normalized cross-entropy loss, where the image loss holds a weight of ⅞, and the text loss holds a weight of ⅛. With 64 attention layers and a staggering 62 attention heads, the transformer architecture is built to capture intricate relationships within the data. This robust model comprises a total of 12 billion parameters, enabling its impressive generative capabilities.

The objective function which DALL-E optimizes is the maximization of joint likelihood of distribution over images, captions, and latent tokens.

In the above mentioned objective function, x refers to images, y as captions and z as latent image tokens. The first term is the joint distribution over images,captions and latent image tokens. This joint distribution factorizes into the distribution over images, given captions and latent image tokens and the joint distribution over captions and latent image tokens. This yields the lower bound:

Now, we will discuss a few techniques which are used by the DALL-E model to make the process more efficient and effective.

- Reduce Memory: To save memory space, a method is used that allows the computer to use less memory when storing numbers. This helps the training process run smoothly without using up too much memory. Training involves a lot of calculations with numbers. Using a 16-bit floating-point format helps speed up these calculations, making the training faster and more efficient. Sometimes, really small numbers in calculations can cause problems. To avoid these issues, techniques are used to handle these small numbers and ensure that calculations are accurate.
- Training Techniques: The two training techniques used are reduce scatter and all reduce.

The training datasets used by DALL-E are Wikipedia images and YFCC100M++. The filters that are applied to clean the data include removal of small captions, non-English words, dates, and extreme aspect ratios. The testing datasets are MS-COCO and CUB-200. Different samples generated from the DALL-E model are shown below.

FID and Inception Score of DALL-E model in comparison to Attn-GAN, DM-GAN, and DF-GAN are shown below. DALL-E performs much better than these GANs.

Hence we can conclude that lots of data and a large model can result in impressive results and zero shot capabilities. Training such a large model is very hard. To deal with memory constraints, we have to split the model. Also, many low-level optimizations are required to deal with bottlenecks which can be studied in the paper.

In conclusion, this article tries to explain the architectural details of OpenAI’s DALL-E. This model later played a role in developing DALL-E 2 and DALL-E 3. For implementation details, view Github

**Thank you for reading!**