One of the important functions of MLOps engineers is to monitor the model performance. Data drift causes degradation in the model performance over a period of time. Let’s discuss data drift and the steps we can take to detect it in detail.
Data drift refers to changes in the data distribution over a period of time. This can occur due to changes in the
- data collection process,
- changes in the data source,
- or changes in the business needs or goals.
Data drift can lead to poor model performance, because the model is being applied to data that is different from the data it was trained on. There are several types of data drift, including concept drift, covariate drift, prior probability shift, and virtual drift.
I have used chatgpt3, excalidraw, GitMind, and carbon.now.sh tools for this article.
- Types of Data Drifts.
- Covariate Shift or Feature Drift or Input Drift.
- Label Drift or Target Drift or Output Drift.
- Prediction Drift or Model Drift
- Concept Drift or Task Drift.
- Create Alert and Notify.
- Python Packages for drift detection.
- Conclusion.
Feature drift is the changes in the input variables or features used to train the model.
Feature drift can occur due to the
- changes in the data collection process
- changes in the data sources which provide the inputs
- changes in the business needs.
Feature drift can lead to poor model performance because the model is being used to make predictions based on features that have changed and may no longer be relevant or accurate.
Let’s consider the following.
Let’s say we have a model that is trained to predict the output variable Y based on the input variables X. We can represent the probability distribution of the input variables (p(X)) and the probability distribution of the output variables (p(Y)) as follows:
p(X): This is the probability distribution of the input variables. It describes the likelihood of different values of X occurring in the data.
p(Y): This is the probability distribution of the output variables. It describes the likelihood of different values of Y occurring in the data.
If there is feature drift in the data, this means that the distribution of the input variables has changed over time. This can be represented mathematically as follows:
- p(X1): This is the probability distribution of the input variables at time t1. It describes the likelihood of different values of X occurring in the data at time t1.
- p(X2): This is the probability distribution of the input variables at time t2. It describes the likelihood of different values of X occurring in the data at time t2.
If p(X1) and p(X2) are significantly different, this could indicate that there is feature drift in the data. This can lead to poor model performance because the model is being applied to data with different features than it was trained on.
Actions Needed to Monitor Feature drift:
- Monitor the input data: Regularly monitor the input data to detect changes in the distribution of the features. This can help you identify when feature drift has occurred, so you can take action to correct it.
- Retrain the model: If you detect feature drift, you may need to retrain the model on updated data to correct for the drift. This will ensure that the model is able to make accurate predictions based on the current distribution of the features.
- Use a drift-detection method: There are several methods available for detecting feature drift, such as the Jensen-Shannon divergence or the Kolmogorov-Smirnov test. These methods can help you detect when the distribution of the features has changed, so you can take action to correct it.
- Use domain knowledge: Utilize your domain knowledge to understand why the feature drift occurred and what steps you can take to correct it. For example, if you know that the data collection process has changed, you may need to adjust the way you collect and preprocess the data.
- Use a robust model: Consider using a model that is more robust to feature drift, such as a random forest or a gradient-boosting model. These models are less sensitive to changes in the input data, so they may be less affected by feature drift.
Methods Available to detect Feature Drift:
There are several methods available for detecting feature drift in machine learning:
- Visual inspection: One simple method is to visually inspect the input data over time to see if there are any noticeable changes in the distribution of the features. This can be done by plotting histograms or scatter plots of the data at different times and comparing them.
- Statistical tests: Statistical tests such as the Jensen-Shannon divergence or the Kolmogorov-Smirnov test can be used to compare the input data distribution at different times and detect significant differences. These tests can provide a quantitative measure of the degree of drift in the data.
- Drift detection algorithms: There are also specialized algorithms that are designed specifically for detecting drift in data. These algorithms can be used to automatically detect changes in the distribution of the features and trigger an alert or action when drift is detected.
- Model performance monitoring: Another approach is to monitor the performance of the model over time and look for significant changes in accuracy or other performance metrics. If the model’s performance begins to degrade, this could indicate that there is a drift in the data.
- Data quality checks: Regularly checking the quality of the input data can also help detect feature drift. For example, if there are sudden changes in the range or variance of the features, this could indicate that there is a drift in the data.
We can monitor the following to detect the feature drift:
Numeric Features
This function computes a variety of summary statistics for each feature in the training and test data, including the mean, median, variance, missing value count, maximum value, and minimum value. It then compares these statistics between the training and test data and returns True
if any of the differences are above a certain threshold. This could be used to detect numeric feature drift, as significant differences in the summary statistics between the training and test data could indicate that there is drift in the data.
Statistical Tests:
Jensen-Shannon:
The Jensen-Shannon divergence is a measure of the similarity between two probability distributions. It is defined as the average of the Kullback-Leibler divergences between the two distributions and a third distribution that is the average of the two. The Jensen-Shannon divergence is always non-negative and takes on a value of zero if and only if the two distributions are identical.
In the context of feature drift detection, the Jensen-Shannon divergence can be used to compare the distribution of the features in the training set to the distribution of the features in the test set. If the Jensen-Shannon divergence between these two distributions is above a certain threshold, this could indicate that there is drift in the data.
This function computes the Jensen-Shannon divergence between the feature distributions in the training and test data using the jensen_shannon_divergence()
function. The Jensen-Shannon divergence is a measure of the similarity between two probability distributions, with a lower divergence indicating a higher degree of similarity. If the divergence is above a certain threshold, this could indicate that there is a drift in the data.
Two-Sample Kolmogorov-Smirnov (KS), Mann-Whitney, or Wilcoxon tests:
The Kolmogorov-Smirnov (KS) test is a non-parametric test that can be used to detect feature drift in machine learning. It is based on the idea of comparing the distribution of two samples to see if they come from the same population.
To use the KS test for feature drift detection, you would first split your data into two sets: a training set and a test set. The training set is used to train the model, while the test set is used to evaluate the model’s performance. You can then use the KS test to compare the distribution of the features in the training set to the distribution of the features in the test set. If the p-value of the test is below a certain threshold, this could indicate that there is a drift in the data.
This function computes the p-value of the two-sample KS test between the feature distributions in the training and test data using the ks_2samp()
function from the scipy.stats
module. The KS test is a non-parametric test that compares the distribution of two samples to see if they come from the same population. If the p-value is below a certain threshold, this could indicate that there is a drift in the data. Usually this threshold, or alpha level, is 0.05.
Wasserstein Distance:
This function computes the Wasserstein distance between the feature distributions in the training and test data using the wasserstein_distance()
function from the scipy.stats
module. The Wasserstein distance is a measure of the distance between two probability distributions, with a higher distance indicating a greater degree of difference between the distributions. If the distance is above a certain threshold, this could indicate that there is a drift in the data.
It is important to note that the Wasserstein distance requires the input data to be one-dimensional and non-negative. If your data does not meet these requirements, you may need to transform it or use a different method for detecting feature drift.
Kullback–Leibler divergence:
This function computes the Kullback-Leibler divergence between the feature distributions in the training and test data using the entropy()
function from the scipy.stats
module. The Kullback-Leibler divergence is a measure of the difference between two probability distributions, with a higher divergence indicating a greater degree of difference between the distributions. If the divergence is above a certain threshold, this could indicate that there is a drift in the data.
Categorical Features
Summary Statistics
Mode, Number of unique values, Number of missing values:
This function computes the mode, number of unique values, and number of missing values in the training and test data and compares them to see if there are significant differences. If the mode or number of unique values has changed significantly, or if the number of missing values has increased significantly, this could indicate that there is a drift in the data.
It is important to note that this method for detecting feature drift in categorical data is based on the assumption that the distribution of the features has not changed significantly between the training and test data. If this assumption is not valid, this method may not be effective at detecting drift. You may need to use a different method or a combination of methods depending on the characteristics of your data.
Statistical Tests:
One-way Chi-Squared Test:
The one-way Chi Squared Contingency Test is a statistical test that can be used to detect feature drift in categorical data. It is based on the idea of comparing the observed frequencies of the different categories in the training and test data to the expected frequencies that would be observed if the categories were independent.
This function computes the Chi Squared statistic and p-value of the test using the chi2_contingency()
function from the scipy.stats
module. The Chi Squared statistic is a measure of the difference between the observed and expected frequencies of the categories in the data, and the p-value is a measure of the probability of observing the observed frequencies if the categories are independent. If the p-value is below a certain threshold, this could indicate that there is drift in the data.
Two Chi-Squared Contingency Test:
The two-way Chi Squared Contingency Test is a statistical test that can be used to detect feature drift in categorical data. It is based on the idea of comparing the observed frequencies of the different categories in the training and test data to the expected frequencies that would be observed if the categories were independent.
This function computes the Chi Squared statistic and p-value of the test using the chi2_contingency()
function from the scipy.stats
module. The Chi Squared statistic is a measure of the difference between the observed and expected frequencies of the categories in the data, and the p-value is a measure of the probability of observing the observed frequencies if the categories are independent. If the p-value is below a certain threshold, this could indicate that there is a drift in the data.
Fisher’s Exact Test:
The Fisher Exact Test is a statistical test that can be used to detect feature drift in categorical data. It is based on the idea of comparing the observed frequencies of the different categories in the training and test data to the expected frequencies that would be observed if the categories were independent.
This function computes the p-value of the Fisher Exact Test using the fisher_exact()
function from the scipy.stats
module. The p-value is a measure of the probability of observing the observed frequencies if the categories are independent. If the p-value is below a certain threshold, this could indicate that there is drift in the data.
It is important to note that the Fisher Exact Test is a more powerful alternative to the Chi Squared Contingency Test when the expected frequencies are small, but it is generally less powerful when the expected frequencies are large. Additionally, the test requires the data to be frequency counts rather than proportions or percentages. If your data is in a different form, you may need to adjust it before running the test.
Class-MonitorDrift:
Combine all the above and create a class called MonitorDrift:
This class has five methods: feature_drift_fisher()
, feature_drift_chi2()
, feature_drift_chi2_one_way()
, feature_drift_jensen_shannon()
, and feature_drift_wasserstein()
. Each method takes a single argument, feature
, which is the name of the feature for which you want to calculate feature drift.
The feature_drift_fisher()
method uses the fisher_exact()
function from the scipy.stats
module to compute the p-value of the Fisher Exact Test for the specified feature. The feature_drift_chi2()
method uses the chi2_contingency()
function to compute the Chi Squared statistic and p-value of the two-way Chi Squared test for the specified feature. The feature_drift_chi2_one_way()
method is similar, but it uses the one-way Chi Squared test instead.
The feature_drift_jensen_shannon()
method uses the jensenshannon()
function from the scipy.spatial.distance
module to compute the Jensen Shannon distance between
Label drift, also known as annotation drift, is a problem that can occur when the labels or categories associated with a dataset change over time. This can happen for a variety of reasons, such as changes in human judgment, the introduction of new categories, or the merging or splitting of existing categories.
- Change in the distribution of the label in the data
- Change in P(Y)
Let’s say we have a dataset with feature vectors x
and corresponding labels y
. The probability distribution of the data is denoted as p(x)
, and the probability distribution of the labels is denoted as p(y)
.
In the absence of label drift, we would expect p(x)
and p(y)
to be stable over time, meaning that the probability of seeing a particular feature vector or label does not change significantly over time. However, when label drift occurs, p(y)
changes over time, meaning that the probability of seeing a particular label may change significantly. This can be caused by changes in the definitions of the labels, changes in the way the labels are assigned, or other factors.
Actions Needed to Monitor Label drift:
- Regularly review and update the labels in your dataset.
- Use active learning or self-learning algorithms.
- Monitor the performance of your model.
- Use a drift detection method to identify a label drift.
- Monitor the data collection process.
Methods Available to detect Label Drift:
Here are some of the methods available to detect Label drift.
- The Page-Hinkley test.
- ADWIN (Adaptive Windowing) .
- DDM (Drift Detection Method).
- The Two-Way Chi-Squared Contingency Test.
- The One-Way Chi-Squared Contingency Test.
- Fischer exact test.
Examples of Label Drift:
Here are some examples of label drift.
- For example, consider a dataset of medical records, where the labels indicate the diagnosis of a patient. If the definitions of the medical conditions change over time (e.g. due to new research or updates to clinical guidelines), then the labels in the dataset may change, causing label drift.
- Customer sentiment analysis based on customer reviews. If the criteria for determining whether a review is positive or negative changes, then it causes label drift.
Page-Hinkley test:
Prediction drift, also known as output drift, occurs when the accuracy of a machine-learning model decreases over time.
The reasons for prediction drift are
- Changes in the data distribution.
- Changes in the model’s performance.
- Changes in the real-world phenomena being predicted. For example, consider a model that is trained to predict the weather. If the weather patterns change significantly over time (e.g. due to climate change), this could cause the model’s predictions to become less accurate.
- Change in the distribution of the predicted label given by the model
- Change in P(hat{Y}| X)P(Y^∣X)
In the context of a classification model, we can represent the predicted class (y) and the true class (x) using probability distributions. Prediction drift occurs if the difference between these probability distributions increases over time.
Methods Available to detect Prediction or Output Drift:
- Monitor model accuracy.
- Use cross-validation.
- Use data splits.
Using PaheHinkley method to detect prediction drift.
Some examples of prediction drift:
- Stock price prediction — changes to the economy or company’s performance.
- Fraud detection model-if the patterns of fraudulent activity change over time.
- Customer churn prediction: if the customer’s behavior or needs change over time.
- Healthcare model: if the patient’s health status or risk factors change over time.
- Concept drift refers to a change in the underlying data distribution that is being learned by a machine-learning model.
- Concept drift can occur over time as the data that is being collected changes, or it can be a result of changes in the real-world phenomena being modeled.
- Concept drift can make machine learning models less accurate, as they may no longer be able to accurately recognize patterns and relationships in the data.
- Concept drift is the change in the relationship between input variables and labels. Concept drift refers to a change in the probability distribution of the input data (p(X)) or the output data (p(Y)).
- Concept drift is the change in the distribution of P(Y| X)P(Y∣X)
- Concept drift results in a current invalid model.
COVID-19 → Concept Drift:
For example, covid has caused the following
- Changes in consumer behavior: People have altered their purchasing patterns in response to lockdowns, economic uncertainty, and other factors. This has caused concept drift in industries that rely on consumer data, such as retail, e-commerce, and advertising.
- Changes in economic data: For example, the unemployment rate and consumer spending patterns have changed significantly as a result of the pandemic, leading to concept drift in economic forecasting models.
- Changes in social data: People’s behavior and attitudes have changed in response to lockdowns, social distancing, and other measures. This has caused concept drift in social media and other online platforms, as well as in marketing and public opinion research.
- Changes in education patterns: Schools and universities have closed or moved to online learning. This has caused concept drift in education data and algorithms that are used to predict student outcomes.
- Changes in work patterns: Covid has caused significant changes in work patterns, as many people have shifted to remote work or experienced changes in their work schedules. This has caused concept drift in data and algorithms that are used to predict workforce productivity and employee outcomes.
- Changes in travel patterns: People have altered their travel plans in response to lockdowns and other restrictions. This has affected algorithms that are used to predict travel demand and patterns.
Methods Available to detect Concept Drift:
- Monitor model performance.
- Use data splits.
- Use statistical tests.
- Use drift detection algorithms.
- Use human input.
Here is the code to monitor model performance constantly.
The above code regularly evaluates the model on the test set and stores the accuracy in a list. If the mean accuracy over the past N evaluation intervals falls below a certain threshold, it could be an indication of concept drift, and the model is retrained on fresh data. If the mean accuracy is still above the threshold, the loop continues and the model is evaluated again after a certain interval.
Check out the below articles on the real-world concept drift.
To notify the machine learning drift do the following.
For example
- Send an email
- Send a message to a slack channel
- Send a message to a MS teams -Teams
- This code defines three functions:
send_email
,send_slack_message
, andsend_teams_message
, which can be used to send a message to an email address, Slack channel, or Microsoft Teams channel, respectively. - To use these functions, you will need to configure your email server (if sending emails), Slack API client, or Microsoft Teams API client with the appropriate credentials and API keys.
- For example, you could call
send_email
to send an email to a designated recipient, orsend_slack_message
to send a message to a Slack channel.
Some of the packages available to detect drift are
skmultiflow:
One of the key features ofskmultiflow
is its ability to handle concept drift, which is the change in the underlying distribution of the data over time. It includes algorithms for detecting concept drift and adapting the machine learning model in real-time to account for the changes in the data. This makes it an ideal tool for building machine learning models that can adapt and continue to perform well as the data changes over time.evidently:
Open-Source Tool To Analyze Data Drift. Evidently is an open-source Python library for data scientists and ML engineers. It helps evaluate, test, and monitor the performance of ML models from validation to production.- Tensorflow data Validation: Tensorflow Data Validation (TFDV) can analyze training and serving data to compute descriptive statistics, infer a schema, and detect data anomalies. The core API supports each piece of functionality, with convenience methods that build on top and can be called in the context of notebooks.
- Deepchecks: Deepchecks Open Source is a python library for data scientists and ML engineers. The package includes extensive test suites for machine learning models and data, built in a way that’s flexible, extendable, and editable.
- drift_detection: This package contains some developmental tools to detect and compare statistical differences between 2 structurally similar pandas dataframes. The intended purpose is to detect data drift — where the statistical properties of an input variable change over time. It provides a class DataDriftDetector which takes in 2 pandas dataframes and provides a few useful methods to compare and analyze the differences between the 2 datasets.
MLOps companies Specialize in drift detection:
Here are some important key players.
- Datadog
- DataRobot
- H2O.ai
- Anodot
- Arize AI
- Superwise.ai
- Whylabs.ai
- ModelOp
- Domino Data Lab
- Algorithmia
- Databricks
- Fiddler
- Seldon
Major Cloud Providers Drift Detection:
Azure:
GCP:
AWS:
Check this Twitter thread.
Check this white paper for more info on statistical tests.
Conclusion:
Data drift is a common and potentially serious problem in machine learning. It occurs when the distribution or characteristics of the data that a model was trained on differ from the distribution or characteristics of the data that the model is being applied to. This can lead to reduced model accuracy, biased predictions, and other problems that can have serious consequences in a production environment. It is critical to monitor for data drift and take action to correct it when it occurs. This can help ensure that the model continues to perform accurately and effectively, preventing costly errors or failures in a production environment.
References:
- Databricks-ML in Production-https://github.com/databricks-academy/ml-in-production-english
- Evidently AI-https://www.youtube.com/watch?v=HGIgUH11nVo
- AWS re:Invent 2020: Detect machine learning (ML) model drift in production-https://www.youtube.com/watch?v=J9T0X9Jxl_w
- Databricks — https://www.youtube.com/watch?v=tGckE83S-4s
- Technically Speaking (E15): Machine learning model drift & MLOps pipelines-https://www.youtube.com/watch?v=aW11vOkSScA
- ML Drift — How to Identify Issues Before They Become Problems // Amy Hodler // MLOps Meetup #89-https://www.youtube.com/watch?v=–KcBoInuqw&t=363s
- Deployment and Monitoring-https://fullstackdeeplearning.com/spring2021/lecture-11/
- Choosing the Right Monitoring Tools for a Full-Stack Solution-https://devops.com/choosing-the-right-monitoring-tools-for-a-full-stack-solution/
9.Aparna dinakaran articles-https://aparnadhinak.medium.com/