Is your model continuously performing as expected?
So you’ve built and deployed your model. Be it using simple logistic regression, SVM, random forest, or the infamous deep learning.
The business users are also excited to see its impact. Whether it keeps customers around with the new personalized and targeted campaigns, increased transaction volumes and sales through that up-sell/cross-sell, or whatever KPIs you promised to be achieved may be.
Congrats.
The first couple of months, everything was going great.
Then suddenly you check the company dashboard/reports that suddenly KPIs are going back to its pre-model state, or perhaps it is worse. Stakeholders are bombarding your push notifications demanding answers. They are questioning your model’s performance.
The most common explanations are:
- Your model overfitted. Perhaps it did not take into consideration some key factors, like seasonality. Perhaps you did not properly sample the data.
- Late data issue. Perhaps the load balancers had a malfunction that it made the system not update the data for a whole day. So either the latest reports are not accurate or the model made its inference on incomplete data. Simply do a
count(*)
on the tables in question and escalate it to IT/data engineering team. Or, it could be… - The data itself has shifted.
It means there is a fundamental change in the data that the model you’ve built can no longer represent the current situation of the business. Be it from internal or external factor. In other words, the data to which the model is trained upon is no longer relevant, therefore the model is outdated.
This is very much likely to happen to business everywhere. Take a look at covid. Remember how much change it brought? Or the current hot news — Inflation. Both of which causes customers’ and businesses to change the way they behave and work in a significant way.
Or let’s take a more simple, common, and less apocalyptic, example. Say you are working in a big telecommunications company. One of your competitors are offering a huge discount in their prepaid packages with generous benefits. Something which no other company has ever done before. Turns out, your market loves this so much that they decided to abandon you for your competitor. It’s not you, it’s them.
All of these are external factors. What about internal ones?
Well, this is definitely your companies’ doing. A change in policy/management. The business is growing/loosing more money than it is making overall, so there are new products/cutbacks. Perhaps they soft-launched a new product variation that is very unique compared to existing ones. Or your star retail employees for 10 years has left/retired/finally use their holiday allowances at once, and customers are not loving the replacements’ service.
Can this be avoided? Yes. If your businesses captures all of these events into your data storage. Which, you can understand, if you’ve worked for any company, to be extremely difficult and costly to do. So, you’ve just got to do with what you have.
This is why it is very important to implement a model monitoring practice in place BEFORE the results are sent out. Every time the model gives inference on the latest data, you need to look out for these data shifts before the results are given to business users. If all is good, then the results can be blasted. If not, and depending on the severity, you can either easily fix it quick or raise awareness that something different has happened AS WELL as the measurable proof.
There are three easy ways to do it:
A simple time series report could easily tell if there is a shift in your data. For instance, a dip in a simple monthly revenue MoM trend is a clear indicator your business is not doing so well overall. If the business keeps doing worse every month, it’s a matter of time until the model won’t recognize the sales data as it used to.
- Population Stability Index (PSI)
This index basically measures the population of the model’s result and how much of it has shifted class/groups. You need to use the latest data and compare it to when the model performed good, e.g. its training data, or the month after that (which also performed good).
Let’s say the model produces N classes/groups/categories. Or it could also be binary classification like churn, in which case, you can for example take the probability of the customers churning, bin them into N equal or non equal groups. For instance, 0–10% as group 1, 11–20% as group 2, etc. It can also be like 0–50% group 1, 51–60% group 2, so on. The important thing is consistency throughout the entire process. Determining this bin could require some business acumen as well, as different bins from the same data and model would impact the model monitoring metrics significantly.
Simply count how many cases/customers fall within those groups for both the training data and latest one. Take the percentage of their respective total data. Then multiply the difference between the training (DT
) and latest data (DL
) with the natural log of DT/DL
.
This is an example of the calculations, you can try to recreate the formula in Excel.

The rule of PSI is:
- At least one bin >20% — Data shift definitely has occurred. Retrain the model. If not, then
- At least one bin is 10–20% — Slight change is required. This would likely give a bit of a drop in the model’s performance. If not, then
- Less than 10% — no significant data shift. Carry on
We can see the above example already has two groups whose PSI is >20%. Therefore we need to investigate what happened to our customer’s behavior and retrain the model.
Bear in mind that these thresholds are not fixed. It depends on how much you and the business are willing to tolerate the change. For instance, in bin 3 you can clearly see theres is a huge difference in absolute numbers but the PSI is small.
- Characteristic Stability Index (CSI)
If PSI determines if there is a data shift on the population, then CSI is to determine which features impacted it. The calculation is basically the exact same one as the PSI. Only difference is, we drill down into the problematic bins (in the example above, group 5, 8, and 9) and group them further based on the features.
Let’s say you have 10 features. Age. Monthly expenses. Outstanding bills. Whatever the case may be. Group them into accordingly and do the same calculations as you did in PSI.
The below example is 2 features’ CSI for bin 8.


From the above example, we can speculate that for bin 8, the number of young customers has risen and elder ones have declined. So much so that it has caused a shift in our data for these bins. Your monthly expenses for this group has also reduced significantly.
Descriptive statistics, PSI, and CSI are very simple and quite effective metrics in monitoring your model’s performance. But one thing that is better than these metrics in determining if the data has shifted, is business and market understanding that is regularly updated. Always stay updated to your business strategies, market, and customers needs.