An introduction to information theory
The universe is overflowing with information. Everything we say, hear, think and see is an information. Everything wraps an information in it, that must follow a certain rule no matter the format. But sometimes we want to process these information, whether to compare it to another or to quantify how much information a signal has from any format. We need a measure, we need a metric.
In 1948, Claude Shannon founded what is known now as information theory. The goal was to transmit a signal from a sender to a receiver efficiently. The concept is simple but genius.
We will try to build this concept together, and reach the famous formula.
Let’s suppose that your friend has picked a city in a set of shuffled cities where the number of countries in each continent is equal and the number of cities in each country is equal. You should guess this city. Your friend will help you with a statement each time. We’ll try to quantify the information included in each statement.
First, he told you, “It’s a city”. This does not provide us with any information. We already know that we should guess a city. So information here should be zero. Next, he told you “it’s a city in Europe”. This provides us with information, but in reality we have 5 continents each equally likely, so it’s a useful information but not very helpful to guess the answer correctly. So this statement contains a low information. Finally, he told you “it’s the capital of France”. This provides us the exact information we are looking for. So it needs to have a big value.
Let’s suppose now that your friend have a die with 2 colors distributed as following: 5 faces are green and one is red. Since there is higher probability of having a green face, it would not be very surprising if you had it. On the other hand, it would be surprising if you picked a red face.
We have now, learned a natural idea about this information theory through these examples. So:
The more probable the event is the less informative is. And the more surprising the event is the more informative is.
Since we have the statement above, it’s tempting to just use the inverse of probability as the measure we want to reach. It’s a very good attempt, however we have at least one problem here. Imagine we have a certain event (p = 1), like the fact that you need to choose a city. We want the information to be 0. But the inverse of it’s probability is 1. So the simple solution here is to add a negative logarithmic function to keep the positive output and its monotony. This will make -log(1) = 0 which is what we need. Another thing why we use the log is that it’s additive. In fact, information of having a green face in the first roll and having red in the second roll is the sum of information of having the green face with the information of having the green face.
Remark: We can see that information for the less probable event is bigger
Another approach to see the information theory is how Shannon introduced it. Historically an antique receiver can send or receive 0 or 1. This is still used in computers. In this way information can be encoded in a bit. Therefore, n-bits contains our information and n-bits can have 2^n possibilities. This means the probability of one combination of n-bits is 1/(2^n). So:
This is where our information measure comes and this is where the base-2 log comes from.
Information = n means the sender have reduced the uncertainty for the receiver by 2^n
Now that we defined the information with a mathematical formula we want to know how much information we‘re getting on average for a certain random variable. This quantity is called entropy. To be practical we will talk only about discrete variables since we want to use this in machine learning use case.
Since we can easily have a good interpretation of information after the last section, we can see the entropy as the average of those information for a certain use case including for a random variable. In fact, if X follows a probability distribution P with a probability mass function p(x). We measure the expected amount of information H(X) through:
For example if we have a coin with a 0.9 to have head and 0.1 to have tail. Then the entropy is:
Entropy has two famous properties:
- if X~P and we want to estimate its distribution P by a distribution Q then
Note: We can prove the property above with Jensen inequality
With equality if and only if P=Q
- if X~P the maximum information for our X with k-class is when its Distribution is uniform among all possible k values and we have:
Note: We can prove the property above with Lagrange optimisation method
Now that we have defined the information measure for one random variable, we would love to have the same measure between two variables X, Y. We want to compare what information X and Y have together with information in each separately. Do they have a common information ?
We can define the information in both variables with joint entropy, and it’s the same formula but with the joint probability as following:
This gives us the information of both variables. We can interpret it as the amount of information contained in the pair (X, Y). We have:
If X and Y are independent then the sum of their entropy equals their joint entropy.
Often the joint entropy is not a goal on its own. In practice, in machine learning for example, we deal with the probability of an output given a certain input. In the NLP field you will compute the probability of a class Y given a text input X, the same goes for computer vision problems. So we need a definition of an information given another. This is very analogous with conditional probability. We have:
This has simple interpretation: The information contained in Y given X, is the information contained in X, Y together minus the information contained in X.
We have answered what is the information contained together for a certain pair of random variables. But now we want to know what is the mutual information that they have. Earlier we have mentioned that the sum of information of two random variables is equal or greater than the information in both together, so why is that? To tackle this question, we’ll have two cases:
- X and Y are independent: It’s safe then to say that X and Y don’t have a mutual information, because H(X, Y) = H(X) + H(Y)
- X and Y are dependent: since H(X, Y) < H(X) + H(Y), then there’s a sort of redundant information that is counted once we’re computing their information together, this information is called mutual information and yes it’s the difference between the sum of information and the information of both. We can also interpret it in different way depending on each of the following equalities:
We can summarize entropy and mutual information with this figure
In an Euclidian space, we can use norms to compute the distance between two points. We would like to do the same when dealing with distributions. In information theory, there’s a way to do so based on what we have learned till now. It’s called Kullback-Leibler divergence or relative entropy and it’s defined by:
Suppose that we have a random variable X that follows the probability distribution P, and we want to estimate P by another probability distribution Q. Kullback-Leibler divergence tells us how good this estimation was.
This metric is always non negative and it equals zero when the distribution Q is exactly P and more importantly it is NOT symmetric so D(P||Q)!=D(Q||P)
If there exists an event x with probability p(x) > 0 and q(x) = 0 then D(P||Q)=∞
From the above formula we can provide an interpretation analogously with entropy since it has an expected value of a logarithmic form. If the probability for an event from P is large but the probability for the same event in Q is small there is a large divergence. If a probability from P is small and the probability form Q is large there is also a large divergence but not as large as the first due the logarithm properties. Hence the non negativity property.
The show you were waiting for will begin now. We will use all what we learned in information theory to reach the famous cross entropy we use in machine learning.
Say we want to solve a machine learning problem with binary classification. Classes are 0 and 1 and our neural network is parametrised by θ.
So maximising the log likelihood is the same as minimising the cross entropy. If we look closely to the formula we can see the same entropy pattern with the logarithmic form, so we can understand our formula from an information theory point of view.
Let’s say that Y follows the distribution P and our prediction follows the distribution Q. We have:
So basically cross entropy is just the sum of information contained in P with the Kullback-Leibler divergence between P and Q. So to minimize the cross entropy comes down to minimize the divergence between the prediction distribution and the true label distribution.
Note: in machine learning H(P) = 0 since p(y) = 0 or 1
This goes the same for multiclass classification with a small change in the formula and we have:
In this article we have introduced many definitions to reach the cross entropy loss function we use in machine learning. The important points to retain are:
- Mathematically, information is the negative log of probability
- The more probable the event is the less informative is. And the more surprising the event is the more informative is.
- Entropy is the average information for a certain random variable
- Joint entropy is the average information for two random variables together.
- Conditional entropy is the average information of a variable given another.
- Mutual information is the average information two variables have in common.
- Kullback-Leibler divergence or relative entropy is a simple metric in information theory to measure how far a distribution is from another.
- Cross entropy between two variables in machine learning is just the Kullback-Leibler divergence between the prediction and label.