It is common practice to normalize data before feeding it into a neural network’s input layer. On the other hand, normalization is not restricted to just the input layer. In a neural network, we also have the option of applying normalization to the outputs of the hidden layers. After being normalized, these outputs will be used as inputs for the hidden layer that comes after them. Therefore, the normalizing process is equally beneficial to the hidden layers. The term for this process is ** batch normalization**.

**The advantages of implementing batch normalization :**

**Find a solution to the covariate shift**: As the values of the inputs are passed through multiple hidden layers, they are subjected to a process in which they are multiplied by weights (scaled), and shifted by biases. Consequently, the distribution of the inputs to each layer will change throughout the training process. The**covariate shift**problem refers to the modification that has occurred in the layer’s input distribution. By adjusting the input values of hidden layers, batch normalization provides an efficient solution to this problem.**Accelerate learning and improve convergence times**: After normalization, the values fall within a narrow range. Therefore, the values are now modest and uniform (have the same range). This accelerates the learning (or convergence) of neural networks.**Reduce the problem of vanishing gradients**: When there are numerous layers in the neural network, there is a possibility that this issue will occur. Throughout the course of training, layer weights are frequently multiplied by very small numbers that are less than 1. When this occurs over multiple layers, the resulting outputs become even more decimalized, to the point where the computer hardware eventually becomes unable to accurately interpret the decimal numbers. Therefore, the problem of a vanishing gradient arises. Because batch normalization normalizes the outputs in hidden layers, the values stay in the same range and can’t get smaller. So, batch normalization is a good way to solve the problem of gradients that go away.

Batches of data are sent across the layers of neural networks. Because the calculations in batch normalization are performed across a batch, the term “batch normalization” derives its name from the word “batch.”

Let’s consider one batch (** β**) having m instances

**Input : **Values of x over mini-batch

The procedures involved in carrying out batch normalization on a batch :

- The algorithm begins by determining the mean for the mini-batch.

Here, ** μB **represents the mini-batch mean,

**m**the number of instances in the mini-batch, and

**xi**each input instance.

2. Then algorithm finds mini batch variance

Here, ** σ 2 B** is the mini-batch variance,

**is the mini-batch mean,**

*μB***is the number of instances in the mini-batch and**

*m***represents each input instance.**

*xi*3. After that, standardize each individual instance of the mini-batch.

Here, ** xˆi** represents each normalized instance.

**is a very small number (e.g. 0.001) that is added to avoid division by zero if**

*ε***is zero in some cases.**

*σ*4. The instances that have been normalized are then multiplied by ** γ** (gamma) and moved by

**(beta). Gamma and beta are new parameters that can be learned by the network.**

*β*Here, ** yi** is the output of the batch normalization process.

- Batch normalization provides a solution to the problem of the
*covariate shift.* - With batch normalization, convergence can occur rapidly, which dramatically decreases network training time.
- Batch normalization drastically reduces the
or*vanishing gradient*problem.*exploding gradient* - Batch normalization plays the role as a regularizer, preventing neural networks from overfitting the training data.

**Conclusion**

In deep neural networks with many hidden layers, normalizing the inputs alone only ensures a stable start to training over the first few iterations. As the normalized input values are propagated through many hidden layers, the distribution of each layer’s inputs (hidden unit values) are changing (** covariate shift**) and the resulting outputs become smaller to represent by computer hardware (

**).So, we need to normalize the outputs of hidden layers too to avoid these problems**

*vanishing gradient*We’ve reached the end of this section. I hope you found the information in this article useful.❤

Read my other articles

Thanks for reading…