Part 1: Learn about calibrating machine learning models to obtain sensible and interpretable probabilities as outputs
Despite the plethora of blogs one can find today that talk about fancy machine learning and deep learning models, I could not find many resources that spoke about model calibration and its importance. What I found even more surprising was that model calibration can be critical for some use cases and yet it is not spoken about enough. Hence, I will write a 4 part series delving into calibrating models. Here is what you can expect to learn once you reach the end of the series.
- What is model calibration and why it is important
- When to and When NOT to calibrate models
- How to assess whether a model is calibrated (reliability curves)
- Different techniques to calibrate a Machine Learning model
- Model calibration in low-data settings
- Calibrating multi-class classifiers
- Calibrating modern Deep Learning Networks in PyTorch
- Calibrating regressors
In today’s blog, we will be looking at the first four highlighted points.
Let’s consider a binary classification task and a model trained on this task. Without any calibration, the model’s outputs cannot be interpreted as true probabilities. For instance, for a cat/dog classifier, if the model outputs that the prediction value for an example being a dog is 0.4, this value cannot be interpreted as a probability. To interpret the output of such a model in terms of a probability, we need to calibrate the model.
Surprisingly, most models out of the box are not calibrated and their prediction values often tend to be under or over confident. What this means is that, they predict values close to 0 and 1 in many cases where they should not be doing so.
Interpreting the output of a non-calibrated and calibrated model
To better understand why we need model calibration, let’s look into the previous example whose output value is 0.4 . Ideally, what we would want this value to represent is the fact that if we were to take 10 such pictures and the model classified them as dogs with probabilities around 0.4 , then in reality 4 of those 10 pictures would actually be dog pictures. This is exactly how we should interpret outputs from a calibrated model.
However, if the model is not calibrated, then we should not expect that this score would mean that 4 out of the 10 pictures will actually be dog pictures.
The whole reason we calibrate models, is that we want the outputs to make sense when interpreted as standalone probabilities. However for some cases such as a model that ranks titles of news articles in terms of quality, we just need to know which title scored the highest if our policy is to select the best title. In this case calibrating the model does not make much sense.
Let’s say we want to classify whether a fire alarm triggers correctly. (We will go through this in code today.) Such a task is critical in the sense that we want to throughly understand our model’s predictions and improve the model so that is sensitive to true fires. Let’s say we run a test for a two examples that classify the chances of a fire as 0.3 and 0.9. For an uncalibrated model, it does not mean that the second example is likely to result in an actual fire thrice as many times as the first one.
Moreover, after deploying this model and receiving some feedback, we now contemplate about improve our smoke detectors and sensors. Running some simulations using our new model, we see that the previous examples score 0.35 and 0.7 now.
Say, improving our system costs 200 thousand US Dollars. We want to know whether we should invest this amount of money for a change in score of 0.05 and 0.2 for each example respectively. For an uncalibrated model, comparing these numbers would not make any sense and hence we won’t be able to correctly estimate whether an investment will lead to tangible gains. But if our models were calibrated, we could settle this dilemma through an expert guided probability based investigation.
Often model calibration is critical for models in production that are being improved through continual learning and feedback.
Now that we know, why we should calibrate our model (if needed) let’s find out how to identify if our model is calibrated.
Those who directly want to skip to the code can access it here.
Today, we will look at the telecom customer churn prediction dataset from Kaggle. You can read more about the covariates and the types of smoke detectors, check out the description page of the dataset on Kaggle. We will try calibrating a LightGBM model on this data since XGBoost usually is uncalibrated out-of-the-box.
The reliability curve is a nice visual method to identify whether or not our model is calibrated. First we create bins from 0 to 1. Then we divide our data according to the predicted outputs and place them into these bins. For instance if we bin our data in intervals of 0.1, we will have 10 bins between 0 and 1. Say we have 5 data points in the first bin, i.e we have 5 points (0.05,0.05,0.02,0.01,0.02) whose model prediction range lies between 0 and 0.1. Now on the X axis we plot the average of these predictions i.e 0.03 and on the Y axis, we plot the empirical probabilities, i.e the fraction of data points with ground truth equal to 1. Say out of our 5 points, 1 point has the ground truth value 1. In that case our y value will be 1/5 = 0.2. Hence the coordinates of our first point are [0.03,0.2]. We do this for all the bins and connect the points to form a line. We then compare this line to the line
y = x and assess the calibration. When the dots are above this line the model is under-predicting the true probability and if they are below the line, model is over-predicting the true probability.
We can construct this plot using Sklearn and it looks like the plot below.
As you can see the model is over-confident till about 0.6 and then under-predicts around 0.8
However, the Sklearn plot has a few flaws and hence I prefer using the plots from Dr. Brian Lucena’s ML-insights package.
This package shows you confidence intervals around the data points and also shows you how many data points you have across each interval (in each bin) and hence you can create custom bin intervals accordingly. As we will also see, sometimes models are over-confident and predict values very close to 0 or 1, in which case the package has a handy logit-scaling feature to show what’s happening around probabilities very close to 0 or 1.
Here is the same plot as the one above created using Ml-insights.
As you can see, we also see the histogram distribution of the data points in each bin along with the confidence interval.
Quantitatively Assessing Model Calibration
According to what I have gathered while reading on some literature in this area, capturing model calibration error has no perfect method. Metrics such as Expected calibration Error are often used in literature but as I have found (and as you can see in my notebook and code), ECE wildly varies with the number of bins you select and hence isn’t always fool proof. I will discuss this metric in more detail in the more advanced calibration blogs in the future. You can read more about ECE in this blog here. I would strongly suggest you go through it.
A metric I use here, based on Dr. Lucena’s blogs is traditional log loss. The simple intuition here is that, log-loss (or cross entropy) penalises models that are too overconfident when making wrong predictions or making predictions that differ significantly from their true probabilities. You can read more about quantitative model calibration in this notebook.
To summarize, we would expect a calibrated model to have a lower log-loss than one that is not calibrated well.
Splitting the Data
Before we do ANY calibration, it is important to understand that we cannot calibrate our model and then test the calibration on the same dataset. Hence to avoid data leakage, we first split the data into three sets- train, validation and test.
First, this is how our uncalibrated LightGBM model performs on our data.
Platt Scaling assumes that there is a logistic relationship between the model predictions and the true probabilities.
Spoiler — This is not true in many cases.
We simply use a logistic regressor to fit on the model predictions for the validation set and the true probabilities of this validation set as the outputs.
Here is how it performs.
As we can see our log-loss has definitely reduced here. Since we have many data points with model predictions close to 0, we can see the benefit of using the Ml-insights package (and its logit scaling feature) here.
This method combines Bayesian classifiers and Decision trees to calibrate models and works better than Platt scaling when we have enough data for it to fit. The detailed algorithm can be found here.
I used the ml-insights package to implement isotonic regression.
This seems to work better than Platt scaling for our data. Although it would be much wiser to come to such conclusions after averaging the results of these experiments over different data splits and random seeds or using cross-validation (as we will see in future blogs).
This algorithm was given by the author of the Ml-insights package (Brian Lucena) and can be found in this paper.
Essentially, the algorithm uses a smooth cubic polynomial (which is chosen to minimize a certain loss as detailed in the paper for those interested in the technical nitty-gritties) and is fit on the model predictions on the validation set and its true probabilities.
Spline Calibration fares the best on our data (for this split at least).
Here is how all of them do in a single plot
A lot of contemporary literature mentions ECE as a metric to measure how well a model is calibrated.
Here is how ECE is formally calculated.
- Choose n the number of bins as we did earlier
- For each bin calculate the average of the model predictions of the data points belonging to that bin and normalize it by the number of data points in that bin.
- For each bin also calculate the fraction of true positives.
- Now for each bin calculate the absolute difference between the values calculated in step 3 and step 4 and multiply this absolute difference by the number of data points in that bin.
- Add the results for all bins calculated in step 4 and normalize this added sum by the total number of samples in all the bins.
The code to calculate ECE can be found in this blog and has been used in my experiments.
However, in my case, the distribution of the data points across the bins was not very uniform (since most data points belonged to the first bin) and thus it is imperative to select the bins for ECE accordingly. We can see how the number of bins is directly affecting ECE in the algorithm.
For instance for only 5 bins, the uncalibrated model seemed to have lesser calibration error than all the other methods.
However when we increase the number of bins, we can see that model calibration has actually helped in our case.
In the code snippets below, this effect can be verified. Please overlook the OE (Overconfidence Error Metric for now) as it is not used widely in literature.
For 5 bins we have
For 50 bins we have
For 500 bins we have
For 5000 bins we have
In today’s blog we saw what model calibration is, how to assess the calibration of a model and some metrics to do so, explored the ml-insights package along with some methods to calibrate a model and finally explored the fallacies of ECE.
Next time we will look into robust calibration for low-data settings, calibrating deep learning models and finally calibrating regressors.
If you liked this here are some more!
I thank Dr. Brian Lucena for his help and advice on various topics related to this blog. I also found his YouTube playlist on model calibration extremely detailed and helpful and most of my experiments are based on his videos.