[ad_1]

As a beginner in the field of machine learning, the very first method you will come across is linear regression. The reason for this is, in my opinion, that studying linear regression is a very comfortable way to get an introduction to machine learning.

So, Let’s Start !!!

In this part, we will go through the process of linear regression and the ways in which we may solve it by using the **Least Squares Method** and the **Matrix Normal Equation**.

## What is Linear Regression?

Linear Regression is a statistical **supervised learning** technique to predict a quantitative variable by **forming a linear relationship** with one or more independent features.

In other words, linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory or independent variable, and the other is considered to be a dependent or target variable.

To define notation for future usage, we will designate x(i) with “input” variables (years of experience in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict (income).

x₁, x₂… xn = Features/Independent variables

y = Target variable

n = Number of features in the training dataset

m = Number of rows in the training dataset

(xᵢ, yᵢ) = One record in the training dataset

h(x) = y : *{ h(x) is parameterized on θ}*

Let’s start by talking about an example. How can we learn to predict the income of a person as a function of the years of experience and the number of educational qualifications he/she has?

**Defining Hypothesis**

To perform supervised learning, we must decide how we’re going to represent functions or hypotheses. The goal of regression modeling is to use these x independent measurements to determine a mathematical function, h(x), that describes the relationship between the input parameters and the output:

Income = h(Years of experience, Number of educational qualifications)

As an example, suppose we intend to approximate y as a linear function of x, h.(x). Since the correlation is linear, let’s construct the equation of this function/hypothesis.

The hypothesis is

h(x⁽ⁱ⁾)=(x⁽ⁱ⁾)ᵀ θorθᵀ(x⁽ⁱ⁾)

What is the best way to choose or learn the parameters from a training set?

For the examples we have used for training, one approach that makes sense is to make h(x) close to y. To formalize this, we define a **cost function** to understand how much we are away from the actual answer. This function is referred to as the “cost” function since it computes the total error or cost between our estimate and the target number.

The Least mean square method is also known as the **Widrow-Hoff** learning rule. In addition to its usage in machine learning, this technique has found practical use in areas as diverse as noise-cancelling headphones, controlling one’s blood pressure, and eliminating echo from long-distance phone conversations. This approach determines the line that best fits the observed data by minimizing the sum of the squares of the vertical deviations between each data point and the line (the vertical deviation of a point is zero if and only if it is precisely on the fitted line). There are no cancellations between positive and negative numbers since the deviations are first squared and then summed. We wish to pick θ so as to minimize J(θ). To do this, let’s utilize a search process that begins with an “initial estimate” for and continually modifies to make J() smaller, until we hopefully settle on a value of.

We want to choose θ so as to minimize J(θ). To do so, let’s use a search

algorithm that begins with an “initial estimate” for θ and continually modifies θ to reduce J(θ) until, ideally, we settle on a value of θ that minimizes J(θ).

## Gradient decent

Everyone gives one analogy, and that is, asking what one would do if they were suddenly trapped at the top of Mount Everest and had to descend to the base. But truly, in mathematical terms, what exactly is this issue and how can we solve it?

To be honest, gradient is a fancy word for **slope.** Which is determined by **partial derivatives of a cost function over its parameters (θ). **Hence, we perform a descent along the direction of the gradient; hence, it’s called “*Gradient Descent.”*

**Although… Slope and Gradient aren’t exactly the same thing**

In general, the word “slope” applies when just two variables are considered.

The slope is the actual tangent or derivative of the curve formed by the function connecting the two variables. It quantifies the rate of change of the function f(x) with respect to x.Thus slope = ∂

𝑓/∂𝑥The gradient is akin to the slope, except that the function whose rate of change we want to measure is affected by more than one variable, i.e.

. In other words,f = f(x,y)is multi-dimensional.f∇

𝑓=∂𝑓/∂𝑥î + ∂𝑓/∂𝑦ĵNote that gradient returns a vector value whereas slope returns a scalar value.

**Thus, the slope is the special case of the gradient operator acting on a single-dimensional function.**

**Explanation:** We understand the direction from the partial derivative of the cost function. Hence, if the partial derivative/gradient is negative, then the slope is downwards and we should continue in the direction. Also, if the partial derivative/gradient is positive, then the slope is upwards and we should go in the opposite direction. To wrap your head around it, if the slope is negative, we want the parameter to increase and if the slope is positive, we want the parameter to decrease in order to reach the minima.

**Learning Rate:**

Now that we have determined the direction we want to proceed in, we must choose the magnitude of our next step. This increment is referred to as the learning rate. We must pick it carefully in order to reach the global minimum.

If we go too quickly, we may overshoot the minimum and continue to bounce about the valley’s ridges without ever reaching the minimum. If you go too slowly, the training may become too expensive to be feasible. Even if that’s not the case, very slow learning rates make the algorithm more prone to getting stuck in a minima, something we’ll cover later in this post.

Once we have the gradient and the learning rate, we move in that direction, recalculate the gradient at the new place, and repeat the procedure.

**The Update rule:**

If the slope is negative, the second term will turn positive (because of the minus sign) and the parameter’s value will rise as a result. Therefore, the theta shifts to the right when the slope is negative. If the slope is positive, theta will drop and travel slightly to the left. When slope is zero, the second component becomes zero and the value of theta does not change.

Now let’s solve the partial derivative of the cost function and substitute it in the update rule.

By substituting eq(2) in eq(1), we simplify the update rule to the below update equation.

We keep repeating the update step for every training example (#m training examples) until the cost function reaches a stable minimum value.

**Batch Gradient Descent:** So, this is simply gradient descent on the original cost function J(θ). This method examines each sample in the whole training set at each step; it is known as batch gradient descent. Batch gradient descent will scan the whole training set before executing a single step, which is an expensive procedure if m is big.**Stochastic gradient descent (also known as incremental gradient descent): **We constantly traverse the training set, and each time we meet a training example, we update the parameters based on the update rule with regard to that particular training example. This algorithm is stochastic gradient descent.

Typically, stochastic gradient descent approaches the minimum significantly more rapidly than batch gradient descent. Due to these factors, stochastic gradient descent is often preferable to batch gradient descent, particularly when the training set is large.

**Local Minima or Global Minima ??**

Okay, it would seem that the story of Gradient Descent has been a fairly joyful one up until this point. Well, let me spoil that for you. Now, in gradient descent, the initialization of parameters is always random. So, what is probable is that we might end up starting our theta at a point from where it can never reach the global minimum. It might get stuck in a local minimum. Scary right!!

The intuition behind always finding global minima with gradient descent for linear regression is that **the cost function of linear regression is a quadratic function and all quadratic functions are convex functions. A convex function assures, when minimizing a function, that if a minimum exists, it is the global minimum. **Since we already know that the solution to the linear least squares problem is a quadratic function, we also know that it is convex.

Therefore, we may be certain that the gradient descent method will converge on the proper minimizer when used. If the function we were attempting to reduce was non-convex, as seen in the graph above, gradient descent may converge on a local minimum rather than the global minimum. Non-convex shapes are hence much more challenging. This is vital since several machine learning models, including neural networks, are non-convex.

Gradient descent gives one way of minimizing J(θ). Let’s talk about an alternative approach, this time performing the minimization explicitly and avoiding the iterative algorithm. In this method, we will minimize J(θ) by explicitly taking its derivatives with respect to the θj ’s and setting them to zero. This would allow us to do so without having to compose reams of algebra and volumes of derivative matrices.

Given a training set, define the design matrix X to be the nXd matrix that contains the training examples’ input values in its rows. Let Y be the n-dimensional vector that has all of the training set’s target values.

Now since, hypothesis is h(x⁽ⁱ⁾) = (x⁽ⁱ⁾)ᵀ θ; we can easily verify that:

Finally, to minimize J(θ), let’s find its derivative with respect to θ and equate it to zero. To minimize J(θ), we set its derivatives to zero and obtain the normal equations:

Setting the above equation to Zero, we will get:

Thus, the value of θ that minimizes J(θ) is given in closed form by the equation. Now, we don’t have to do an iterative update to the parameters but rather a single-step calculation to solve for θ.

Although there are advantages and disadvantages of the normal equation method.

**Advantages:**

1) The normal equation approach does not need iteration, and there is no issue with convergence failure.

2) The normal equation approach requires no initial selection for α.

**Disadvantages:**

1) The dimension of the feature vector is n, computing the inversion has complexity (n³). Hence, when n is larger, the normal equation method is slower.

For example, when the number of features n is very small, n<1000, it is faster to use the normal method to solve for θ. But, when n>10000 or even x>10⁶, then it might be a good time to go from a normal solution to the gradient method.

[ad_2]

Source link