Gradient descent is a crucial part of machine learning. It’s an algorithm used to help a computer “learn” one step at a time. Conceptually, it’s like going down a hill until you reach a flat valley. We’ll go over where the hills and valleys come from, and how this actually helps a computer learn.
As of now, this article is part of a series with 5 others. It is recommended that you read them in order, but feel free to skip any if you already know the material.
Table of Contents
If you wouldn’t mind, please fill out this short survey! It is optionally anonymous and would help me a lot, especially when improving the quality of these articles.
Gradient descent is comprised of two terms — gradient, and descent.
Gradient is comparable to slope, but applicable to more than just two dimensions.
Descent explains what you’re doing with that multidimensional slope — you’re using it to travel “downward” until you reach a flat point — a local minimum.
This method is applied to something called a “cost function” in machine learning, which overall gives you how accurate your model is at predicting data that doesn’t exist in its database. As the name implies, a higher cost is bad, so gradient descent is used to get the cost as low as possible.
Before going into any details, it’s important that you have an understanding of any math terminology that may be used throughout this article.
If you’ve taken Precalculus, you can skip just the following subsection.
Delta (Δ) — how much the value of something changes, calculated by final value minus initial value.
Tangent — a line that just “hugs” the graph of a function by a single point, at least in the nearby area. This line is unique, as “tipping” it at the point would result in a double intersection — once again, in the local region.
Slope — how “steep” a line is, measured by Δy / Δx
Parameters — values that you input into a function and get an output
Vector — something which has both direction and magnitude. Note that “direction” isn’t limited to just 2D or 3D.
Note that the vector notation above contains the magnitude (i.e. “strength”) in both the x-direction and the y-direction.
Concept of Dimensions
A function that has n parameters can be graphed in n+1 dimensions — the extra dimension carries the output values of the function.
Hyperplane — in an n-dimensional function, a tangent hyperplane can be drawn which has n-1 dimensions. This is similar to how a 2D graph has a 1D hyperplane (tangent line), or how a 3D graph has a 2D hyperplane (regular plane). This concept is demonstrated below.
Continuous — A function that does not have any holes, jumps, etc. In 2D, think of it as something that you can draw without lifting your pencil. In 3D, think of it as a piece of fabrlope of a tangent line at a given point in a function. Think of it as how “steep” the function is at that point.
Differentiability — A function where you can draw a tangent hyperplane. For example, a differentiable 2D function means that you should be able to draw a tangent 1D line on any x-value. For a function to be differentiable, it must be continuous (no gaps or jumps) and not have cusps.
Derivative — the slope of a tangent line at a given point in a function. Think of it as how “steep” the function is at that point.
IMPORTANT: For an expression that you’re trying to find the derivative of, you can take the derivative of each term separately and add them together. This key knowledge will be useful in future articles.
Partial derivative — this is very similar to a derivative, except it’s used for functions with more than one parameter. The variable at the bottom next to the ∂ represents which parameter you’re taking the derivative with respect to.
Nabla/Del (∇) — this is a vector that contains the partial derivatives of all the parameters of a function. Mathematically, it’s called the gradient of a function, which can be thought of as a multidimensional slope. Since this represents how “steep” the slope is in all axis directions, put together it represents the direction of steepest descent for the function.
Minimum — the parameters of a function that lead to the lowest possible output value
Local minimum — a “bowl” in a function. While there are “walls” on all sides containing it that point upward, this may not be the lowest point of the function — it’s simply the lowest point in its local region.
J(θ₁, θ₂, θ₃, …) represents the cost function. We’ll get into what this actually is in later articles — for now, just know it’s a function whose output we want to minimize.
θ represents all the parameters that we plug into J. Essentially, it’s a shorthand form of writing (θ₁, θ₂, θ₃, …, θₙ).
θᵢ represents the value of the i-th parameter that we plug into J.
α, γ, and η all are used by convention to represent the step size, or learning rate. For this article, we will use η.
All the definitions above are great and all, but what the heck is gradient descent? Well, much like the name implies, it uses the gradient to find the direction and magnitude of the steepest descent and performs a step in that direction. Then, it simply repeats this until it reaches a flat “valley” — a local minimum. But how does this exactly work?
Equation and Explanation
Mathematically, the equation for gradient descent is the following:
What does this exactly do? Let’s break it down one step at a time.
First, we take the gradient of something called the cost function (we’ll go over what this is in more detail in later articles) to figure out which way to travel, and by how much. Then, we multiply that by η, which represents the step size. This variable must be manually set — for now. Think of this as the “reigns” — setting this too high can lead to everything spiraling out of control, whereas setting this too low will impede progress.
More visually, setting a learning rate that’s too high will lead to the algorithm being too aggressive in its steps, leading it to constantly overshoot the local minimum.
Setting a learning rate that’s too low will lead to the algorithm taking really small steps and not getting anywhere significant.
A learning rate that’s just right will help the algorithm converge to a local minimum quickly and efficiently.
As you can see by now, choosing this value is very subjecting and case-dependent, but general good values are 0.1, 0.01, 0.001, or 0.0001.
After you figure out how much to travel in what direction, you can simply update the old parameters. By now, you’re probably wondering why we subtract and not add. Let’s take a simpler example to demonstrate this concept:
In the picture above, the slope of the tangent line is positive. However, the goal is to converge to the minimum. So, we must progress in the negative direction. This same logic can be applied to negative slopes as well:
This time, we have a negative slope, and to converge to the minimum, we must go in the opposite direction.
By applying the same logic to gradients, we have to subtract to travel towards a local minimum, no matter which direction the gradient points.
Earlier, we saw how we had to manually choose a learning rate, and saw how it may require some trial and error to get a satisfactory result. For those new to machine learning, simply picking and trying out a value is good enough. However, more advanced techniques do exist. For example, the Barzilai-Borwein method continuously adjusted how “fast” the descent is using data from the last data point.
Qualitatively, when there is a sharp change in the gradient, the learning rate gets reduced so the descent can be more “careful”. However, when the gradient only changes slightly, it encourages larger step sizes. This kind of optimization helps reach a local minimum faster than regular methods. In addition, this method is guaranteed to converge, so you don’t have to worry about picking learning rates that are too small or too large — you simply pick one to start with, and let the algorithm handle the rest.
To get a more mathematical intuition, we can break this equation down into three parts — the left numerator term, the right numerator term, and the denominator. The left numerator term measures the length of the step in each dimension. The right numerator term measures the change in the gradient. After multiplying them and summing the vector components, you get a combined measure of your step size and how much it lowered your cost.
The denominator, on the other hand, measures the change in the gradient, takes its magnitude, and squares it. When put together, we can see that having a large change in gradient will make the denominator larger than the numerator, and “slow down” the descent. On the other hand, if the gradient converges really slowly, the denominator will be low and will push the learning rate to be higher. However, if a sudden change in gradient is encountered, the denominator will once again take over and reduce the learning rate.
In summary, this equation helps achieve a good learning rate balance at each iteration so that the function converges efficiently in all directions. In contrast, a constant learning rate just ensures that the function doesn’t overshoot in any direction and doesn’t care to adjust itself if the rate of progress speeds up or slows down.
Congrats! Now you know exactly how gradient descent works, both conceptually and mathematically. However, one thing was left unmentioned so far — some functions have a problem such that while it does appear to be flat in the local region, it isn’t actually a local minimum. This is called a saddle point.
How do you detect this? Well, nothing can be determined with absolute certainty when exploring such a vast range of values as parameters by guessing repeatedly. It’s hard to know when you’re in such a spot mathematically since you’re only looking at your locality. There are various methods to try and overcome this problem, but it’s still an open field of research.
There are also other notable methods to do gradient descent:
AdaGrad — deals with step sizes more effectively, more effective with sparse gradient applications such as natural language and computer vision
AdaDelta — AdaGrad, but faster
RMSProp — effectively works with noisy problems
ADAM — best of both worlds between RMSProp and AdaGrad
In the end, there is no method that “tops all.” Each situation is different and needs to be treated as such. However, for general applications, the classical form of gradient descent described in this article is more than enough.
Further Learning — Derivatives
Explaining how derivatives work is difficult, as it builds on many years of mathematics and requires a strong precalculus background. However, just because you don’t understand how it works doesn’t mean that you can’t do it yourself!
I would recommend taking a gander at Math Is Fun’s summary on this topic. Of course, if you would like to learn about this in more detail, feel free to explore other internet resources — I would recommend Math Is Fun and Khan Academy.
What about partial derivatives, you ask? That’s quite easy once you know how to take a derivative. You simply treat all the variables other than the one you’re taking the derivative of as a constant — that’s it!