[ad_1]

A brief resume about the main metrics to evaluate a machine learning model

Currently, is usual to see machine learning applications running on devices we use. The bank application can use machine learning to detect possible credit card fraud, while a healthcare application can predict the possibility of some users developing diabetes based on their exam results, we can see that is easy to find these applications nowadays. When we use these applications, we do not want an inaccurate model that predicts credit card fraud that does not exist or predicts a diabetes possibility in a healthy person.

Based on the last paragraph’s considerations, we can see that is very important to build accurate machine learning models to avoid misunderstands and huge errors. Therefore, one of the most important things in a pipeline to create a new model is the evaluation, which uses different techniques and metrics to evaluate if a model is accurate or not.

The goal of this article is to present the main metrics used by machine learning engineers to evaluate if some built model is good or not to be used in the real world.

Let’s start talking about the *confusion matrix*, a very important matrix that helps us to analyze easier the model’s confidence. Before defining the confusion matrix, it is necessary to define some metrics as follows. For an easier understanding, let’s consider a model that classifies an image as cat or not cat.

## True Positive

A *TP *(True Positive) occurs when the model predicts *correctly *the *positive *class [1]:

## True Negative

A *TN* (True Negative) occurs when the model predicts *correctly *the *negative *class:

## False Positive

An FP (False Positive) occurs when the model predicts *wrongly *the *positive* class:

## False Negative

An FN (False Negative) occurs when the model predicts *wrongly *the *negative* class:

Considering the examples above, now we can understand better what is the called “confusion matrix”. A *confusion matrix *is an NxN matrix that contains the correct and incorrect model classifications, where N is the number of classes. One axis of a confusion matrix is the label that the model predicted, and the other axis is the ground truth (correct answer) [2]. Let’s see an example:

Now, Let’s see a *confusion matrix *filled and make a little analysis:

Some information we can get from image 5:

- The
**TP**was 26, which means that for 26 “Cat” images as input of the model, it predicted “Cat” correctly. - The
**TN**was 59, which means that for 59 “Not Cat” images as input, the model predicted “Not Cat” correctly. - The
**FP**was 4, which means that for 4 “Not Cat” images as input, the model predicted “Cat” incorrectly. - The
**FN**was 5, which means that for 5 “Cat” images as input, the model predicted “Not Cat” incorrectly.

To define the metrics we will see in this section, we have to remember what is TP, TN, FP, and FN. If you do not remember or skipped the last section, please read it again.

## Accuracy

Is the fraction of predictions our model got right [3]. We can compute this metric as follows:

accuracy = (Number of correct predictions )/(Number of predictions)

We can compute the accuracy using the metrics we learned previously, because the number of right predictions is the TP and TN, and the number of predictions is the sum of all predictions done, in this case:

accuracy = (TP+TN)/(TP+TN+FP+FN)

Let’s compute the accuracy to the example of the Image 5:

accuracy = (26+59)/(26+59+4+5) = 85/94 = 0.9042 = 90.42%

It means that our model got the right predictions in 90.42% of the test, but we need to be careful when we are handling with unbalanced datasets because the accuracy can lead to wrong conclusions about the model. Therefore, we need to use more than one metric in order to evaluate the model’s confidence.

## Precision

The *precision *is the fraction of right positive predictions among all positive predictions. In math terms:

precision = TP/(TP+FP)

Using the example of Image 5:

precision = 26/(26+4) = 0.8667 = 86.67%

It means that when our model predicts the class “Cat” it does it right in 86.67% of cases.

## Recall

The *recall *is the fraction of right positive predictions among all ACTUALLY positive predictions. In math terms:

recall = TP/(TP+FN)

Using the example of Image 5:

recall = 26/(26+5) = 0.8387 = 83.87%

It means that our model predicts the class “Cat” correctly 83.87% of the time it evaluates a picture that is really a “Cat”.

## f1-score

It is calculated from the precision and recall of the test, with this metric you can evaluate only one value instead of two:

f1-score = (2*precision*recall)/(precision+recall)

Using the example of Image 5:

f1-score = (2*0.8667*0.8387)/(0.8667+0.8387) = 0.8524

A very important topic we have to treat here is the *ROC curve* (receiver operating characteristic curve), which is a graph showing the performance of a classification model when we vary the threshold of classification [4]. Before showing the graph, we need to define the more two metrics:

**True Positive Rate**

The TPR (True Positive Rate) is the a synonym of *recall, *therefore:

recall = TP/(TP+FN)

If you have questions, please go back to the section “recall”.

## False Positive Rate

The FPR (False Positive Rate) is the fraction of wrong positive predictions among all ACTUALLY wrong predictions. In math terms:

FPR= FP/(FP+TN)

**ROC**

Now we can talk about the *ROC curve* again. This curve is TPR vs FPR at differente classification tresholds [4]. The following figure shows a typical ROC curve:

As an example, we are going to build a classification model using the framework scikit-learn and the dataset *digits*. First of all, let’s create the model that classify an image of a digit.

# Importing the libraries and the dataset

import numpy as np

import pandas as pd

from sklearn import datasets

digits = datasets.load_digits()

# Let’s split the dataset in train and test

from sklearn.model_selection import train_test_split

x = digits.data

Y = digits.target

x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size=0.3, random_state=50)

# Creating the model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=10000).fit(x_train, Y_train)

y_pred = model.predict(x_test)

To compute the metrics using the scikit-learn, we can use the following codes:

# Accuracy

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(Y_test, y_pred)

#Precision

from sklearn.metrics import precision_score

precision = precision_score(Y_test, y_pred, average=’micro’)

# Recall

from sklearn.metrics import recall_score

recall = recall_score(Y_test, y_pred, average=’micro’)

# f1-score

from sklearn.metrics import f1_score

f1 = f1_score(Y_test, y_pred, average=’micro’)

print(f’Accuracy: {accuracy}n Precision: {precision}n Recall: {recall}n f1-score: {f1}’)

As we saw, is very important to evaluate the metrics of a trained model, and it can be done easily using some framework for machine learning development.

[2] https://developers.google.com/machine-learning/glossary#confusion_matrix

[3] https://developers.google.com/machine-learning/crash-course/classification/accuracy

[4] https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

[ad_2]

Source link