[ad_1]

Learning Rate

The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

Finding a good learning rate can be tricky. If you set it way too high, training may actually diverge . If you set it too low, training will eventually converge to the optimum, but it will take a very long time. If you set it slightly too high, it will make progress very quickly at first, but it will end up dancing around the optimum, never really settling down. If you have a limited computing budget, you may have to interrupt training before it has converged properly, yielding a suboptimal solution as you see in (figure 1)

You can, however, perform better than a constant learning rate by beginning with a high learning rate and reducing it after it stops moving forward quickly. This will enable you to arrive at a good solution more quickly than you would with the best constant learning rate. Many differences exist.

many methods to slow learning during training. These tactics are referred as as scheduling of learning

**Scheduling Learning**

In this article we would discuss 4 type of scheduling:

1- Power scheduling

2- Exponential scheduling

3- Piecewise constant scheduling

4- Performance scheduling

**Power scheduling**

Set the learning rate to a function of the iteration number t

The initial learning rate η0 , the power c (typically set to 1) and the steps s are hyperparameters. The learning rate drops at each step, and after s steps it is down to η0 / 2. After s more steps, it is down to η0 / 3. Then down to η0 / 4, then η0 / 5, and so on. As you can see, this schedule first drops quickly, then more and more slowly. Of course, this requires tuning η0 , s (and possibly c).

**Exponential scheduling**

Set the learning rate to.

The learning rate will gradually drop by a factor of 10 every s steps. While power scheduling reduces the learning rate more and more slowly, exponential scheduling keeps slashing it by a factor of 10 every s steps.

**Piecewise constant scheduling**

Use a fixed learning rate for a certain number of epochs, such as 0 = 0.1 for 5 epochs, followed by a reduced learning rate, such as 1 = 0.001 for 50 epochs, and so on. Although this approach has the potential to be very effective, it requires some trial and error to determine the optimal order of learning rates and the duration of each one.

**Performance scheduling**

When the error stops decreasing, measure the validation error every N steps (exactly like for early stopping), and decrease the learning rate by a factor of.

This is simple, plus when you save the model, the learning rate and its schedule (including its state) get saved as well so you don’t have to initialize them each time like loss function .

A 2013 paper22 by Andrew Senior et al. compared the performance of some of the most popular learning schedules when training deep neural networks for speech recognition using Momentum optimization. The authors concluded that, in this setting, both performance scheduling and exponential scheduling performed well. They favored exponential scheduling because it was easy to tune and it converged slightly faster to the optimal solution

**References**

Understand the dynamics of learning rate

Hands-On Machine Learning with Scikit-Learn and TensorFlow

AN EMPIRICAL STUDY OF LEARNING RATES IN DEEP NEURAL NETWORKS FOR SPEECH RECOGNITION

[ad_2]

Source link