When it comes to evaluating the performance of a binary classification model, precision and recall are two important metrics that are commonly used.
There are many great posts on the topic of Precision and Recall (see        ). This post will dive a little deeper into understanding these metrics along with the related metrics F-score and Average Precision and illustrate some visual ways to think about them.
The formulas for calculating precision and recall are the following:
where True/False refers to whether the model guessed the classification correctly or not and Positives/Negatives refers to the binary value (either 0 or 1) we are trying to predict.
A nice graphic on the wikipedia page gives a visualization of this formula.
Only knowing one without the other is not very helpful.
- A simple model that guesses 1 for every data point will have the highest possible recall score of 1
- A model that guesses 0 for every data point will have the highest possible recall score of 1
Typically people use F-score or Average Precision which incorporate both Precision and Recall for evaluating a model’s performance. In an imbalanced classification problem we will want to weight Precision and Recall accordingly
- If returning a false positive is much more consequential than a false negative (for instance a model that detects whether a user is a bot) we will want to weight Recall much higher
- If returning a false negative is much more consequential than a false positive (for instance a model that detects cancer) we will want to weight Recall much higher
Now I’ll discuss how to visualize Precision and Recall.
Let us consider the following situation. We are working with a binary classification model and we have just trained a model which outputs a confidence value between label 0 and label 1. We wish to use precision and recall to assess the performance of this model.
Here is the data with the model predictions and actual labels.
The first thing we do is sort our data by model prediction values.
Here is a bar chart of the predicted values of the sorted data
Now we are ready to Visualize Precision and Recall. To do this we choose a threshold value such that if the model prediction is above the threshold our prediction is 1, otherwise the prediction is 0.
Here is a visualization of our predictions with a threshold = .5.
Yellow data indices indicate data above the threshold, i.e. our prediction is 1. Green rows mean our prediction is correct while red means our predictions are incorrect. Red rows above our threshold are type 1 errors (false positives) while red rows below our threshold are type 2 errors (false negatives).
Note that if we choose a threshold of 0 then our model is just predicting 1 for every data point and a threshold of 1 would mean our model is always predicting 0.
So how does Precision and Recall show up in this image? Here are the formulas in terms of the different colored rows.
What happens is that as we slide the threshold value to lower or higher values, it swaps the color of each row it passes, red to green and green to red. This affects the precision and recall scores. Here we slide the threshold from .4 to .8:
Average Precision. Area under the Precision-Recall curve.
We can plot Precision and Recall as functions of our threshold and we get the following blocky looking plot. Each vertical jump corresponds to when a row was flipped in color while the flat portions are when changing the threshold did not flip a row.
The next insight to make is that we can just cut out the middle-man, the threshold values, and just plot the Recall and Precision value pairs that show up. Now we are looking at the scatterplot version of the Precision-Recall plots found on scikit-learns page.
Here is the corresponding Precision-Recall plots using Scikit-learn’s module PrecisionRecallDisplay. Note that instead of connecting the dots with straight lines, they draw these box connections. The area under this curve is the Average precision.
F-score. Why Use Harmonic Mean?
This simplifies in terms of the colored row counts above as
This stackoverflow post has some very good explanations as to why we use the harmonic mean. Reasons such as
- It punishes extreme values (e.g. Precision = 0 or Recall = 0)
- Algebraic compatibility due to matching numerators of Precision and Recall
A visual explanation is that harmonic means show up when taking average slopes or rates of change. It is the average rate obtained when when travelling the same distance at two different rates.
In our situation, Precision and Recall can be interpreted two different ‘rates’ as we increase the size of our dataset (gather more samples). Think of this as adding a bunch of new rows to our colored table of data above.
- Precision is the rate of new true positives predicted over the new rows our model placed over the threshold.
- Recall is the rate of true positives added over the total positives being added.
F-score can be interpreted as the average rate of Precision and Recall to achieve the same number of true positives.
Similarly the weighted version known as the F-beta score (found here on wikipedia)
Can be interpreted as the average slope when achieving beta² times as many true positives with Recall slope compared to the Precision Slope
As discussed earlier there are many good resources already out there to learn about Precision and Recall:
The goal of this article is build some mathematical intuition about these metrics.