[ad_1]

*This post is a part of a series of articles on Machine Learning in Dart:*

In my last article, we got familiar with a concise and elegant solution to the Ordinary Least Squares problem — The Closed Form Solution.

This article will dig into another optimization algorithm — gradient descent. In the end, we will code the algorithm in Dart programming language.

Let’s remember what our goal is: we have to find the minimum of the cost function, which is the squared error in our case. The function’s typical graph:

The graph looks like a pit. So let’s imagine that we are staying at the edge of it. It’s completely dark, and we have to find the bottom of the pit.

Let’s take a step back. It’s evident that we are ascending; we can feel it, ascending is always harder than descending. We were wrong, backward direction won’t lead us to the bottom. So let’s change the direction and take some steps forward. Now we can feel that we are descending. After a lot of steps, the pit becomes less steep. Finally, we reached the bottom, but we don’t know it yet. We take a step forward, we feel that we start to ascend, so we need to step back. Now we perfectly know that we are at the bottom of the pit.

We’ve just described the gradient descent algorithm. To make a computer understand it, we should first translate the intuitive description into math and then into code in a programming language.

A step in the context of math means finding a derivative of the function at some point. Derivative, in its turn, means how steep the function is at the point:

- a great absolute value of the derivative implies that the function is quite steep at this point;
- a small value means that the function is relatively flat at the point;
- a value equal to zero means that the point is the local optimum of the function;
- a positive sign of the derivative means that the function is ascending;
- a negative sign of the derivative means that the function is descending.

Formally, the minimum is the coordinates of the lowest point. For every iteration of the gradient descent procedure, the coordinates are updated.

Here we should come up with an update rule.

We need to change the coordinate in the direction of the minimum, but how can we determine the direction? It’s easy — if the derivative at the new point is negative, it means that the function is descending, and we may move the coordinate to the right:

If the derivative at the point is positive, the function is ascending, and we need to change the direction and move the coordinate to the left:

If the derivative is zero, we are at the minimum:

The following rule describes the process the best:

In the case of a negative derivative, we move the coordinate to the right:

In the case of a positive derivative, we move the coordinate to the left:

For more precise convergence, it’s better to add a coefficient parameter for the derivative:

It’s called *learning rate*.

When to stop the process? Let’s subtract the new and current coordinates from one another:

If the result value is too big, we should repeat the process.

Our next challenge is finding out what to do with the cost function of many arguments. Recall, what the function is:

where `M`

is the number of features, `i`

is the index of a row from the feature matrix, `j`

is the column index.

Let’s find an expression for derivative value with respect to every argument. Since our function is “complex”, meaning it consists of several parts, we must apply the chain rule. First, let’s find the derivative of square:

Then let’s find the derivative of the inner part of the function:

And finally, let’s multiply both expressions:

Great, now we have a bunch of partial derivatives. So we need to organize them somehow. Here we come to the Gradient vector definition:

The gradient is a vector consisting of partial derivatives.

The update rule stays the same. We need just to rewrite it in a vector notation:

The detailed gradient looks like this:

But there are a lot of rows in the matrix; how can we calculate the gradient if every row results in its own gradient? It appears that we may just sum up the gradients of all rows:

So the rule now looks as following:

Let’s try to code the gradient descent procedure in Dart!

I’m going to use the ml_linalg package. The library simplifies work with vectors and matrices. Also, it helps us do math effectively because of the library’s SIMD nature.

The procedure will be organized as a function. Let’s come up with its signature:

[ad_2]

Source link