The following plot shows the log loss when *y* = 1:

The log loss equals to 0 only in case of a perfect prediction (*p* = 1 and *y* = 1, or *p* = 0 and *y* = 0), and approaches infinity as the prediction gets worse (i.e., when *y* = 1 and *p* → 0 or *y* = 0 and *p* → 1).

The** cost function **calculates the average loss over the whole data set:

The cost function can be written in a vectorized form as follows:

where **y** = (*y*₁, …, *yₙ*) is a vector that contains all the labels of the training samples, and **p** = (*p*₁, …, *pₙ*) is a vector that contains all the predicted probabilities of the model for all the training samples.

This cost function is convex, i.e., it has a single global minimum. However, there is no closed-form solution for finding the optimal **w*** (due to the non-linearities introduced by the log function). Therefore, we need to use iterative optimization methods such as gradient descent in order to find the minimum.

Gradient descent is an iterative approach for finding a minimum of a function, where we take small steps in the opposite direction of the gradient in order to get closer to the minimum:

In order to use gradient descent to find the minimum of the least squares cost, we need to compute the partial derivatives of *J*(**w**) with respect to each one of the weights.

The partial derivative of *J*(**w**) with respect to any of the weights *wⱼ* is:

**Proof**:

The gradient vector can be written in vectorized form as follows:

And the gradient descent update rule is:

where *α* is a learning rate that controls the step size (0 < *α *< 1).

Note that whenever you use gradient descent, you must make sure that your data set is **normalized **(otherwise gradient descent may take steps of different sizes in different directions, which will make it unstable).

We will now implement the logistic regression model in Python from scratch, including its cost function and gradient computation, optimizing the model using gradient descent, evaluation of the model, and plotting the final decision boundary.

For the demonstration we will use the Iris data set (BSD license). The original data set contains 150 samples of Iris flowers that belong to one of three species (Setosa, Versicolor and Virginica). We will make it into a binary classification problem by using only the first two types of flowers (Setosa and Versicolor). In addition, we will use only first two features of each flower (sepal width and sepal length).

## Loading the Data Set

Let’s first import the required libraries and fix the random seed in order to get reproducible results:

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as snsnp.random.seed(0)

Next, we load the data set:

`from sklearn.datasets import load_iris`iris = load_iris()

X = iris.data[:, :2] # Take only the first two features

y = iris.target

# Take only the setosa and versicolor flowers

X = X[(y == 0) | (y == 1)]

y = y[(y == 0) | (y == 1)]

Let’s plot the data:

`def plot_data(X, y):`

sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], style=iris.target_names[y],

palette=['r','b'], markers=('s','o'), edgecolor='k')

plt.xlabel(iris.feature_names[0])

plt.ylabel(iris.feature_names[1])

plt.legend()

`plot_data(X, y)`

As can be seen, the data set is linearly separable, therefore logistic regression should be able to find the boundary between the two classes.

Next, we need to add a column of ones to the features matrix *X* in order to represent the bias (*w*₀):

`# Add a column for the bias`

n = X.shape[0]

X_with_bias = np.hstack((np.ones((n, 1)), X))

We now split the data set into training and test sets:

`from sklearn.model_selection import train_test_split`X_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, random_state=0)

## Model Implementation

We are now ready to implement the logistic regression model. We start by defining a helper function to compute the sigmoid function:

`def sigmoid(z):`

""" Compute the sigmoid of z (z can be a scalar or a vector). """

z = np.array(z)

return 1 / (1 + np.exp(-z))

Next, we implement the cost function that returns the cost of a logistic regression model with parameters **w** on a given data set (*X*, **y**), and also its gradient with respect to **w**.

`def cost_function(X, y, w):`

""" J, grad = cost_function(X, y, w) computes the cost of a logistic regression model

with parameters w and the gradient of the cost w.r.t. to the parameters. """

# Compute the cost

p = sigmoid(X @ w)

J = -(1/n) * (y @ np.log(p) + (1-y) @ np.log(1-p)) # Compute the gradient

grad = (1/n) * X.T @ (p - y)

return J, grad

Note that we are using the vectorized forms of the cost and the gradient functions that have been shown previously.

To sanity check this function, let’s compute the cost and gradient of the model on some random weight vector:

`w = np.random.rand(X_train.shape[1])`

cost, grad = cost_function(X_train, y_train, w)print('w:', w)

print('Cost at w:', cost)

print('Gradient at w (zeros):', grad)

The output we get is:

`w: [0.5488135 0.71518937 0.60276338]`

Cost at w: 2.314505839067951

Gradient at w (zeros): [0.36855061 1.86634895 1.27264487]

## Gradient Descent Implementation

We now implement gradient descent in order to find the optimal **w*** that minimizes the cost function of the model on a given training set. The algorithm will run at most *max_iter* passes over the training set (defaults to 5000), unless the cost has not decreased by at least *tol* (defaults to 0.0001) since the previous iteration, in which case the training stops immediately.

`def optimize_model(X, y, alpha=0.01, max_iter=5000, tol=0.0001):`

""" Optimize the model using gradient descent.

X, y: The training set

alpha: The learning rate

max_iter: The maximum number of passes over the training set (epochs)

tol: The stopping criterion. Training will stop when (new_cost > cost - tol)

"""

w = np.random.rand(X.shape[1])

cost, grad = cost_function(X, y, w)for i in range(max_iter + 1):

w = w - alpha * grad

new_cost, grad = cost_function(X, y, w)

if new_cost > cost - tol:

print(f'Converged after {i} iterations')

return w, new_cost

cost = new_cost

print('Maximum number of iterations reached')

return w, cost

Normally at this point you would have to normalize your data set, since gradient descent does not work well with features that have different scales. In our specific data set normalization is not necessary since the ranges of the two features are similar.

Let’s now call this function to optimize our model:

`opt_w, cost = optimize_model(X_train, y_train)`print('opt_w:', opt_w)

print('Cost at opt_w:', cost)

The algorithm converges after 1,413 iterations and the optimal **w*** we get is:

`Converged after 1413 iterations`

opt_w: [ 0.28014029 0.80541854 -1.48367938]

Cost at opt_w: 0.28389717767222555

There are other optimizers you can use which are often faster than gradient descent, such as conjugate gradient (CG) and truncated Newton (TNC). See scipy.optimize.minimize for more details on how to use these optimizers.

## Using the Model for Predictions

Now that we have found the optimal parameters of the model, we can use it for predictions.

First, let’s write a function that gets a matrix of new samples *X* and returns their probabilities of belonging to the positive class:

`def predict_prob(X, w):`

""" Return the probability that samples in X belong to the positive class

X: the feature matrix (every row in X represents one sample)

w: the learned logistic regression parameters

"""

p = sigmoid(X @ w)

return p

The function computes the predictions of the model by simply taking the sigmoid of *Xᵗ***w **(which computes *σ*(**w***ᵗ***x**) for every row **x** in the matrix).

For example, let’s find out the probability that a sample located at (6, 2) belongs to the versicolor class:

`predict_prob([[1, 6, 2]], opt_w)`

`array([0.89522808])`

This sample has 89.52% chance of being a versicolor flower. This makes sense since this sample is located well within the area of the versicolor flowers far from the border between the classes.

On the other hand, the probability that a sample located at (5.5, 3) belongs to the versicolor class is:

`predict_prob([[1, 5.5, 3]], opt_w)`

`array([0.56436688])`

This time the probability is much lower (only 56.44%), since this sample is close to the border between the classes.