A deep dive into Batch, Stochastic, and Mini-Batch Gradient Descent algorithms using Python
Gradient descent is a popular optimization algorithm that is used in machine learning and deep learning models such as linear regression, logistic regression, and neural networks. It uses first-order derivatives iteratively to minimize the cost function by updating model coefficients (for regression) and weights (for neural networks).
In this article, we will delve into the mathematical theory of gradient descent and explore how to perform calculations using Python. We will examine various implementations including Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, and assess their effectiveness on a range of test cases.
While following the article, you can check out the Jupyter Notebook on my GitHub for complete analysis and code.
Before a deep dive into gradient descent, let’s first go through the loss function.
Loss or cost are used interchangeably to describe the error in a prediction. A loss value indicates how different a prediction is from the actual value and the loss function aggregates all the loss values from multiple data points into a single number.
You can see in the image below, the model on the left has high loss whereas the model on the right has low loss and fits the data better.
The loss function (J) is used as a performance measurement for prediction algorithms and the main goal of a predictive model is to minimize its loss function, which is determined by the values of the model parameters (i.e., θ0 and θ1).
For example, linear regression models frequently use squared loss to compute the loss value and mean squared error is the loss function that averages all squared losses.
The linear regression model works behind the scenes by going through several iterations to optimize its coefficients and reach the lowest possible mean squared error.
What is Gradient Descent?
The gradient descent algorithm is usually described with a mountain analogy:
⛰ Imagine yourself standing atop a mountain, with limited visibility, and you want to reach the ground. While descending, you’ll encounter slopes and pass them using larger or smaller steps. Once you’ve reached a slope that is almost leveled, you’ll know that you’ve arrived at the lowest point. ⛰
In technical terms, gradient refers to these slopes. When the slope is zero, it may indicate that you’ve reached a function’s minimum or maximum value.
At any given point on a curve, the steepness of the slope can be determined by a tangent line — a straight line that touches the point (red lines in the image above). Similar to the tangent line, the gradient of a point on the loss function is calculated with respect to the parameters, and a small step is taken in the opposite direction to reduce the loss.
To summarize, the process of gradient descent can be broken down into the following steps:
- Select a starting point for the model parameters.
- Determine the gradient of the cost function with respect to the parameters and continually adjust the parameter values through iterative steps to minimize the cost function.
- Repeat step 2 until the cost function no longer decreases or the maximum number of iterations is reached.
We can examine the gradient calculation for the previously defined cost (loss) function. Although we are utilizing linear regression with an intercept and coefficient, this reasoning can be extended to regression models incorporating several variables.
💡 Sometimes, the point that has been reached may only be a local minimum or a plateau. In such cases, the model needs to continue iterating until it reaches the global minimum. Reaching the global minimum is unfortunately not guaranteed but with a proper number of iterations and a learning rate we can increase the chances.
Learning_rate
is the hyperparameter of gradient descent to define the size of the learning step. It can be tuned using hyperparameter tuning techniques.
- If the
learning_rate
is set too high it could result in a jump that produces a loss value greater than the starting point. A highlearning_rate
might cause gradient descent to diverge, leading it to continually obtain higher loss values and preventing it from finding the minimum.
- If the
learning_rate
is set too low it can lead to a lengthy computation process where gradient descent iterates through numerous rounds of gradient calculations to reach convergence and discover the minimum loss value.
The value of the learning step is determined by the slope of the curve, which means that as we approach the minimum point, the learning steps become smaller.
When using low learning rates, the progress made will be steady, whereas high learning rates may result in either exponential progress or being stuck at low points.