*Part 1: Learn about calibrating machine learning models to obtain sensible and interpretable probabilities as outputs*

Despite the plethora of blogs one can find today that talk about fancy machine learning and deep learning models, I could not find many resources that spoke about model calibration and its importance. What I found even more surprising was that model calibration can be critical for some use cases and yet it is not spoken about enough. Hence, I will write a 4 part series delving into calibrating models. Here is what you can expect to learn once you reach the end of the series.

## Learning Outcomes

**What is model calibration and why it is important****When to and When NOT to calibrate models****How to assess whether a model is calibrated (reliability curves)****Different techniques to calibrate a Machine Learning model**- Model calibration in low-data settings
- Calibrating multi-class classifiers
- Calibrating modern Deep Learning Networks in PyTorch
- Calibrating regressors

In today’s blog, we will be looking at the first four highlighted points.

Let’s consider a binary classification task and a model trained on this task. Without any calibration, the model’s outputs cannot be interpreted as true probabilities. For instance, for a cat/dog classifier, if the model outputs that the prediction value for an example being a dog is 0.4, this value cannot be interpreted as a probability. *To interpret the output of such a model in terms of a probability, we need to calibrate the model.*

Surprisingly, most models out of the box are not calibrated and their prediction values often tend to be under or over confident. What this means is that, they predict values close to 0 and 1 in many cases where they should not be doing so.

## Interpreting the output of a non-calibrated and calibrated model

To better understand why we need model calibration, let’s look into the previous example whose output value is 0.4 .** Ideally, what we would want this value to represent is the fact that if we were to take 10 such pictures and the model classified them as dogs with probabilities around 0.4 , then in reality 4 of those 10 pictures would actually be dog pictures. **This is exactly how we should interpret outputs from a ** calibrated** model.

**However, if the model is not calibrated, then we should not expect that this score would mean that 4 out of the 10 pictures will actually be dog pictures.**

The whole reason we calibrate models, is that we want the outputs to make sense when interpreted as standalone probabilities. However for some cases such as a model that ranks titles of news articles in terms of quality, we just need to know which title scored the highest if our policy is to select the best title. In this case calibrating the model does not make much sense.

Let’s say we want to classify whether a fire alarm triggers correctly. (We will go through this in code today.) Such a task is critical in the sense that we want to throughly understand our model’s predictions and improve the model so that is sensitive to true fires. Let’s say we run a test for a two examples that classify the chances of a fire as 0.3 and 0.9. **For an uncalibrated model, it does not mean that the second example is likely to result in an actual fire thrice as many times as the first one.**

Moreover, after deploying this model and receiving some feedback, we now contemplate about improve our smoke detectors and sensors. Running some simulations using our new model, we see that the previous examples score 0.35 and 0.7 now.

Say, improving our system costs 200 thousand US Dollars. We want to know whether we should invest this amount of money for a change in score of 0.05 and 0.2 for each example respectively. **For an uncalibrated model, comparing these numbers would not make any sense and hence we won’t be able to correctly estimate whether an investment will lead to tangible gains. But if our models were calibrated, we could settle this dilemma through an expert guided probability based investigation.**

*Often model calibration is critical for models in production that are being improved through continual learning and feedback.*

Now that we know, why we should calibrate our model (if needed) let’s find out how to identify if our model is calibrated.

Those who directly want to skip to the code can access it here.

## The Dataset

Today, we will look at the telecom customer churn prediction dataset from Kaggle. You can read more about the covariates and the types of smoke detectors, check out the description page of the dataset on Kaggle. **We will try calibrating a LightGBM model on this data since XGBoost usually is uncalibrated out-of-the-box.**

**The dataset is officially from IBM and can be freely downloaded ****here****. It is licensed under the Apache License 2.0 as found ****here****.**

## Reliability Curves

The reliability curve is a nice visual method to identify whether or not our model is calibrated. First we create bins from 0 to 1. Then we divide our data according to the predicted outputs and place them into these bins. For instance if we bin our data in intervals of 0.1, we will have 10 bins between 0 and 1. Say we have 5 data points in the first bin, i.e we have 5 points **(0.05,0.05,0.02,0.01,0.02)** whose model prediction range lies between 0 and 0.1. Now on the X axis we plot the average of these predictions i.e **0.03 and on the Y axis, we plot the empirical probabilities, i.e the fraction of data points with ground truth equal to 1. Say out of our 5 points, 1 point has the ground truth value 1. In that case our y value will be 1/5 = 0.2. Hence the coordinates of our first point are [0.03,0.2]. **We do this for all the bins and connect the points to form a line. We then compare this line to the line

**y = x **and assess the calibration. **When the dots are above this line the model is under-predicting the true probability and if they are below the line, model is over-predicting the true probability.**

We can construct this plot using Sklearn and it looks like the plot below.

As you can see the model is over-confident till about 0.6 and then under-predicts around 0.8

However, the Sklearn plot has a few flaws and hence **I prefer using the plots from Dr. Brian Lucena’s ML-insights ****package**.

This package shows you confidence intervals around the data points and also shows you how many data points you have across each interval (in each bin) and hence you can create custom bin intervals accordingly. As we will also see, sometimes models are over-confident and predict values very close to 0 or 1, in which case the package has a handy logit-scaling feature to show what’s happening around probabilities very close to 0 or 1.

Here is the same plot as the one above created using Ml-insights.

As you can see, we also see the histogram distribution of the data points in each bin along with the confidence interval.

## Quantitatively Assessing Model Calibration

According to what I have gathered while reading on some literature in this area, capturing model calibration error has no perfect method. Metrics such as Expected calibration Error are often used in literature but as I have found (and as you can see in my notebook and code), ECE wildly varies with the number of bins you select and hence isn’t always fool proof. I will discuss this metric in more detail in the more advanced calibration blogs in the future. **You can read more about ECE in this blog ***here***. I would strongly suggest you go through it.**

A metric I use here, based on Dr. Lucena’s blogs is traditional log loss. The simple intuition here is that, log-loss (or cross entropy) penalises models that are too overconfident when making wrong predictions or making predictions that differ significantly from their true probabilities. You can read more about quantitative model calibration in this notebook.

*To summarize, we would expect a calibrated model to have a lower log-loss than one that is not calibrated well.*

## Splitting the Data

Before we do ANY calibration, it is important to understand that we cannot calibrate our model and then test the calibration on the same dataset. Hence to avoid data leakage, we first split the data into three sets- train, validation and test.