Explaining the popular GAN min-max game and the Total Loss of the model
Generative Adversarial Networks (GANs) have recently become very popular in the world of Artificial Intelligence, and especially within the computer vision field. With the introduction of the scientific article “Generative Adversarial Nets” by Ian J. Goodfellow et al. , a powerful new strategy emerged for developing generative models, and with its, many studies and research projects have arisen since then, developing new applications that we see nowadays such as the latest DALL-E 2  or GLIDE (both applications have been developed using diffusion models, which is a more recent paradigm for generative models. However, GAN today continues to be a widely used model capable of solving multiple problems)
But since everything always starts at the beginning, in this article I would like to show the meaning and reasoning of the original GAN optimization function that everybody has ever heard about and its difference with the Total Loss function of the model (you must know that many other variations have been created later depending on the model’s purpose)
Generative Adversarial Networks are a class of deep learning frameworks that have been designed as generative models with the purpose of generating new complex data (outputs) such as images or audio that have never existed before.
To train a GAN we only need a set of data (images, audio…) that we want to replicate or imitate, and the network will figure out the ways to create new data that looks like the examples of our dataset.
In other words, we are giving the model some examples as inputs to “get inspired” and give it total freedom to generate new outputs
This process of training where we just feed the network the X inputs without accompanying them with any labels (or desired outputs) is called Unsupervised Learning.
GAN architecture is formed by two networks that compete with each other (thus the name “Adversarial Networks”). Commonly we refer to these networks as the Generator (G) and Discriminator (D). The Generator task is to learn the function that generates the data starting from random noise, while the Discriminator has to decide whether the generated data is “real” or not (here “real” means that the data belongs to the examples of our dataset) so that we can measure the performance of the model and adjust the parameters. Both networks are trained and learn at the same time.
There are many different variations and modifications to GANs training. However, if we follow the original paper , the vanilla GAN training loop is the following:
for number of training iterations do:
- Generate m examples (images, audio…) from the sample distribution (i.e. random noise z) that we will denote: G(z)
- Take m examples from the training dataset: x
- Mix all the examples (generated and training dataset) and feed them to the Discriminator D. The output of D will be between 0 and 1, meaning 0 the example is fake and 1 the example is real
- Measure the Discriminator Loss functions and adjust parameters
- Generate new m examples G’(z)
- Feed G’(z) to the Discriminator. Measure the Generator Loss function and adjust the parameters.
Note. More recent approaches to GAN training measure Generator Loss and adjust its parameters along with the Discriminator in the 4th step, skipping 5 and 6 and saving time and computer resources.
If you read the original GAN paper, you will get to the following function that defines the optimizing function for the model.
Note. The above formula is the Optimization function, i.e. the expression that both networks (Generator and Discriminator) try to optimize. In this case, G wants to minimize it while D aims to maximize it. However, this is not the total Loss function of the model, which tells us about its performance.
To understand the mix-max game we have to think about what measures the performance of our model, so the networks can optimize it. As the GAN architecture is formed by two networks that are trained simultaneously, we have to compute two metrics: Generator Loss and Discriminator Loss.
Discriminator Loss Function
According to the training loop described in the paper , the Discriminator receives a batch of m examples from the dataset and other m examples from the Generator and outputs a number ∈ [0,1] that is the probability of the input data belonging to the dataset distribution (i.e. the probability of the data being “real”).
We already know which examples are real and which ones are generated before entering them into the Discriminator (the examples x that come from the dataset are real and the Generators’ outputs G(z) generated), and thus we can assign them a label: y = 0 (generated), y = 1 (real).
Now we can train the Discriminator as a common Binary Classifier using the Binary Cross Entropy loss function:
However, since this is a binary classifier, the summation alternates terms:
– When input is real, label y = 1 → summation ∑ = log(D(k))
– When input is generated, label y = 0 → summation ∑ = log(1-D(k))
Thus, we can rewrite the expression in a simpler form:
As we know, the Discriminator wants to minimize its loss, thus, it wants to minimize the above formula (argmin Dloss). However, we can modify the formula by taking out the negative symbol. Now instead of minimizing the expression, we must maximize it:
Finally, we can operate the terms:
And rewrite the expression:
On the other hand, the objective of the Generator is to fake the Discriminator. Then, the Generator must do the opposite to the Discriminator and find the minimum of V(G,D).
Now we can sum up both expressions (Discriminator and Generator Optimization functions) and obtain the final one:
(Tadaa!) We have finally obtained the Optimization Function. However, as I’ve said before, this is not the Total Loss function, which tells us about the overall performance of our model. But before getting there, we need to compute first the Generator loss:
Generator Loss Function
Looking back again at the Optimization function, we see that the Generator only participates in the second term of the expression E(log(1-D(G(z))), while the first one remains constant. Therefore the Generator Loss Function that tries to minimize is:
However, we haven’t finished yet. As explained in the original paper, “Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data.” i.e. In the early stages of training, it is very easy for the Discriminator to differentiate between real images and generated since the Generator hasn’t yet learned. In this case, log(1 − D(G(z))) saturates, since D(G(z)) ∼ 0
In order to avoid this situation, the researchers propose the following: “Rather than training G to minimize log(1 − D(G(z))) we can train G to maximize log D(G(z))”.
This is the same as saying that instead of training the Generator to minimize the probability of the image being fake, it is going to maximize the probability of the image being real.
Essentially both optimization approaches are the same, as we can see in the graph:
So for our purposes, the Generator loss function that we will use is:
Note. In practice, when coding the generator loss function, the negative form of the above formula is commonly used, so that instead of maximizing the function, the goal is to minimize it. Thus facilitating the adjustment of parameters with libraries such as Tensorflow. This is also important to understand the Total Loss Function, which is the next section.
In the article, the loss formulas of each component (Generator and Discriminator) have been described, as well as the optimization function of the model. But, how can we measure the overall performance of the model?
Just looking at the Optimization function isn’t a good measurement, because as we have already seen, the Optimization function is a modification of the Discriminator Loss function, and thus it doesn’t reflect the Generator’s performance (although the Generator loss function derives from it, we are just taking into account the performance of the Discriminator in that function)
On the other hand, one could think of adding both Loss functions (Discriminator and Generator) and, although this is a good intuition, we need to take into account some nuances:
1. Both individual loss functions must aim to be minimized or maximized.
Otherwise, the addition will reflect a higher or lower error than the one that should be.
For instance, let’s take the Optimization Function that wants to be maximized by D:
and the first Generator Loss function that aims to be minimized by G:
When D is doing poorly (low error) but G is doing great (low error), the overall performance will yield a low error, meaning that both networks (G and D) are doing a great job, although we know that one of them isn’t.
Moreover, if one of the losses aims to be minimized and the other one maximized, we wouldn’t know if a high error is good or bad.
Note. If we use Loss functions that aim to be maximized, it might sound counterintuitive to call it “Error”, since the higher the “Error” the better performance. However, we can also transform it using a logarithmic scale such that log(1+”Error”)
2. For building a Total Loss function, the individual Losses must be in the same range of values
Let’s take now as an example the first Discriminator Loss that we talked about (the Binary Cross Entropy):
And the previous Generator Loss function used in the last point:
Now both functions satisfy the condition of aiming to be minimized. However, the Discriminator Loss is in the range [0, +∞) while the Generator Loss outputs values in (-∞,0]. Adding these two functions is the same as subtracting the Generator Loss, therefore we are saying that the Overall Loss is the Discriminator Loss without the effect of the Generator (i.e. E(log(D(xi))), where E denotes the expected value) and that is not correct.
Nevertheless, we have another adding combination yet. What if we add the first Discriminator Loss and the negative form of the Modified Generator Loss?
(hurray!) This is the GAN total Loss function. However, in case you don’t believe me, let’s check if it satisfies the properties.
✅ 1. We know that Dloss is intended to be minimized and that the negative form of Modified Generator Loss is also intended to be minimized.
✅ 2. Dloss outputs values in the range [0, +∞), and turns out that the negative Modified Generator Loss also maps values into that same range.
Thus, we are adding to errors of the same category and therefore computing the Total Loss function of our model.
Summarizing the main key points of this article:
- GAN Optimization function (also called min-max game) and the Total Loss of the model are different concepts:
min-max Optimization ≠ Total Loss
- The origin of the Optimization function comes from the Binary Cross-Entropy (which in turn is the Discriminator Loss) and from which also derives the Generator Loss function.
- The Generator loss function is modified in practice so that the logarithms do not saturate. And this modification is also useful to compute the Total Loss function for the model.
- The Total Loss function = Dloss + Gloss. However, not all the formulas can be used and we need to take into account two key points:
– Both individual loss functions must aim to be minimized or maximized.
– The individual Losses must be in the same range of values
I hope you enjoyed the article and it helped you. Please feel free to leave a comment. Any feedback or correction will be welcome