The math behind two of the most used activation functions in Machine Learning
The Sigmoid and SoftMax functions define activation functions used in Machine Learning, and more specifically in the field of Deep Learning for classification methods.
Activation function: Function that transforms the weighted sum of a neuron so that the output is non-linear
Note. The sigmoid function is also called The Logistic Function since it was first introduced with the algorithm of Logistic regression
Both functions take a value Χ from the range of the real numbers R and output a number between 0 and 1 that represents the probability of Χ of belonging to a certain class.
Notation: P(Y=k|X=x) is read as “The probability of Y being k given the input X being x”.

But if both functions map the same transformation (i.e. do the same thing), what is the difference between them?
Sigmoid is used for binary classification methods where we only have 2 classes, while SoftMax applies to multiclass problems. In fact, the SoftMax function is an extension of the Sigmoid function.
Therefore, the input and output of both functions are slightly different since Sigmoid receives just one input and only outputs a single number that represents the probability of belonging to class1 (remember that we only have 2 classes so the probability of belonging to class2 = 1 – P(class1)). While on the other hand SoftMax is vectorized, meaning that takes a vector with the same number of entries as classes we have and outputs another vector where each component represents the probability of belonging to that class.

We already know what each function does and in which cases to use them. The only thing left is the mathematical formulation (More math notation!)
Sigmoid function
Imagine our model outputs a single value X that can take any value from the real numbers X ∈ (-∞,+∞) and we want to transform that number into a probability P ∈ [0,1] that represents the probability of belonging to the first class (we just have 2 classes).
However, to solve this problem we must think in the opposite way. How do I transform a probability P ∈ [0,1] into a value X ∈ (-∞,+∞)?
Although it seems illogical, the solution lies in horse betting (Mathematicians have always liked games).
In horse betting, there is a commonly used term called odds [1]. When we state that the odds of horse number 17 winning the race are 3/8 we are actually saying that after 11 races the horse will win 3 of them and lose 8. Mathematically the odds can be seen as a ratio between two independent events and are expressed as:

The odds can take any positive value and therefore have no ceiling restriction [0,+∞). However, if we take the log-odd we find that the range value changes to (-∞, +∞). The log of the odds is called the logit function:

Finally, the function that we were looking for, i.e. the Logistic function or SIGMOID FUNCTION, is the inverse of the logit (maps values from the range (-∞, +∞) into [0,1])

Thus obtaining the formula:

Where X denotes the input (in the case of neural networks the input is the weighted sum of the last neuron, usually represented by z = x1·w1 + x2·w2 + … + xn·wn)
SoftMax function
On the other hand, we’ve seen that SoftMax takes a vector as input. This vector has the same dimension as classes we have. We will call it X (although another common notation in neural networks is Z, where each element of the vector is the output of the penultimate layer)

Same as with the Sigmoid function, the input belongs to the Real values (in this case each of the vector entries) xi ∈ (-∞,+∞) and we want to output a vector where each component is a probability P ∈ [0,1]. Moreover, the output vector must be a probability distribution over all the predicted classes, i.e. all the entries of the vector must add up to 1. This restriction can be translated as each input must belong to one class and just to one.
We can think about X as the vector that contains the logits of P(Y=i|X) for each of the classes since the logits can be any real number (here i represent the class number). Remember that logit ∈ (-∞, +∞)

However, unlike in the binary classification problem, we cannot apply the Sigmoid function. The reason is that when applying Sigmoid we obtain isolated probabilities, not a probability distribution over all predicted classes, and therefore the output vector elements don’t add up to 1 [2].

To convert X into a probability distribution we can apply the exponential function and obtain the odds ∈ [0,+∞)

After that, we can see that the odd is a monotone increasing function over the probability. So when the probability increases the odd does the same in an exponential way [2].

Therefore, we can use the odd (or its equivalent exp(logit)) as a score to predict the probability, since the higher the odd the higher the probability.
Finally, we can just normalize the result by dividing by the sum of all the odds, so that the range value changes from [0,+∞) to [0,1] and we make sure that the sum of all the elements is equal to 1, thus building a probability distribution over all the predicted classes.

Now, if we take the same example as before we see that the output vector is indeed a probability distribution and that all its entries add up to 1
