In this article, I’ll show you when to use which of the above metrics.
Before we begin, you might think that the accuracy metric is missing here. I have already written an article about the accuracy metric and when it definitely should not be used, feel free to check it out.
Personally, I believe that a real-world scenario helps to better understand the relevance of a metric, so let’s create one. We will assume that we are training a model to predict breast cancer. To make things a bit more realistic, here are some negative and positive samples (although we won’t use them in this article):
Suppose that our model in the test set achieves the following results:
If this tabular presentation of classification results is new to you, you may want to take a look at the Confusion Matrix, which can also be helpful in evaluating the performance of a model.
The table shows that two of the positive samples were correctly predicted as positive (tp = true positives) and seven of the positive samples were incorrectly predicted as negative (fn = false negatives). In addition, one negative sample was predicted as positive (fp = false positives), while 9,990 negative samples were predicted as negative (tn = true negatives).
What do you think, is this model suitable for use in practice?
I bet your answer is no. But if we consider only the accuracy, which is 99.92% ((2 + 9.990) / 10.000), our model actually performs very well. (If you want to learn more about why accuracy is inappropriate in this scenario, please read my article about it). This is when other metrics like precision, recall and F1 score are helpful. So let’s get started with these:
Since we have trained a model to predict breast cancer, we obviously want to evaluate the value of the model’s predictions. One way to do this is to examine how many of the positively predicted breast cancer cases were actually positive. The resulting ratio is called the precision of the model. Formally, precision is defined as follows:
If you want, you can also think of the precision as the accuracy of the positive breast cancer predictions. Which would be the following formula:
Another important measure for evaluating the performance of the model is how many of the 9 actual positive cases in the test set were predicted to be positive. Compared to precision, where we measure the quality of the model’s positive predictions, we use recall to evaluate how extensively the model covers or detects all actual positive cases. Formally, recall is defined as follows:
When I saw the upper definition for the first time, I got confused with all the true positives, false negatives, etc. If you can feel me, I hope the following definition of the recall helps for your understanding of it:
To better understand the difference between the two metrics, we will substitute both formulas with the results of our model on the test set:
As we can see, the precision of our model is much better than the recall. This is because 2 out of 3 positive predictions of the model were correct. On the other hand, out of the 9 positive samples, our model predicted only 2 as positive, resulting in low recall.
Considering the breast cancer classification, the two metrics can be interpreted as follows: We are quite confident that if our model predicts breast cancer, it will actually be correct (Precision). But we are not confident that the model will predict all actual breast cancer cases (Recall).
Ideally, we train a model that achieves both good precision and good recall. However, it is also possible that a model has, for example, perfect recall and low precision (if the model would make a positive prediction every time in our scenario, consider why) and vice versa. Thanks to the F-Score, we can evaluate both precision and recall in a single metric:
Usually, beta is set to 1, which means that precision and recall have the same influence on the metric. For this reason, we often implicitly talk about the F1 score instead of the F score. However, you can choose a different value for beta depending on your use case. For example, if you want the recall to have half the impact compared to the precision, set beta to 0.5. Conversely, the recall can be twice as impactful compared to the precision if beta is set to 2.
As can be seen in the next visualisation, the F1-Score becomes high when both Recall and Precision are high. If only one of the metrics has a high value and the other a low value, the F1-Score is also low.
Or to put it in Nancy Chinchor’s words:
The F-measure is higher if the values of recall and precision are more towards the center of the precision-recall graph than at the extremes and their sums are the same. […] a system which has recall of 50% and precision of 50% has a higher F-measure than a system which has recall of 20% and precision of 80%. This behavior is exactly what we want from a single measure .
As a last step, let’s calculate the F1 Score for our cancer classification task:
We can see that the F1-Score of our model is quite low. Moreover, the F1-Score is closer to the recall value (0.22) than to the precision value (0.66). We have already seen this behaviour of the F1-Score in the visualisation above.
Let’s wrap things up.
In general, precision, recall, and F-score measure the quality of the predicted positives of a classification system. For example, other metrics such as the accuracy measure the quality of all predicted classes (both positives and negatives). In addition, we have seen precision and recall as two different measures that allow us to examine different aspects of positive predictions of a classification system. Finally, the F-score provides a way to include both recall and precision in a single metric, which is often desired, which is why the F1-score is so popular.