Digging into the machine learning cycle: scoping, data, modeling, and deployment
Machine Learning Operations, known as MLOps, is an emerging discipline that comprises a set of practices for maintaining and deploying machine learning models in production so that they are efficient and reliable.
The word is a compound of “machine learning” and the continuous development practice of DevOps, which is also a set of practices for shortening the systems development lifecycle and providing continuous delivery with high-quality software.
The MLOps covers the entire ML lifecycle, which is divided into four sections: scoping, data, modeling, and deployment.
Many of the concepts explained in this article come from the Coursera course “Introduction to Machine Learning in Production”, which it is highly recommended to watch.
Scoping is the process in which we plan the project and make decisions about what to accomplish and how to accomplish it. At this stage, key metrics such as accuracy or latency must be defined, as well as resources and schedule.
In general, this stage covers the following process: (1) identify a business problem, (2) think about potential AI solutions, (3) assess the feasibility and values of potential solutions, and (4) determine the timeline and budget of the project.
Likewise, at this stage, it is necessary to carry out an initial evaluation of the technical feasibility of the project for the established schedule and budget. To do so, we can bring together the technical and commercial teams to agree on the defined conditions. If the team is unsure about the specifications, it’s probably a good idea to do a proof of concept (POC) to test the feasibility.
Once we defined the scoping of the project, the next stage is to collect the proper data for the project, label it appropriately, and set a baseline to be used in the modeling stage.
Regarding the data, there are two major types of problems:
- Small data(≤10K) vs Big data (>10K)
- Structured data vs Unstructured data
While for small data labels are critical, for big data the major challenge is the data processing, which is how the data is collected, filtered, sorted, processed, analyzed, stored, and then presented in a readable format. Also, while humans or data augmentation techniques can be used to label unstructured data, it is generally difficult to obtain new data for structured datasets. A good practice when selecting the team for the project is to select people who have worked on the same type of data.
To obtain the labels, the recommendation is to discuss with clients how the data labels will be obtained at the scoping stage. Ideally, clients provide the labels. If that’s not the case, it is still interesting that they are involved in the labeling process, since they are the ones who usually have the domain knowledge.
If there are no labels, then it is necessary to decide the effort for the labeling. For example, if you plan to spend 3 days on the training stage and 2 days on the error analysis stage, then you should probably spend ~1-2 days labeling the data. Otherwise, the labeling cost increases drastically compared with the rest of the project. If there are a lot of data to be labeled, it is better to first label one part of the data, train the model and assess the error, and then label again, repeating this process iteratively.
Lastly, it is a good practice to spend some time manually reviewing the labels (1) to get to know the data and the labels and (2) to ensure that most of the labels were correctly labeled.
2.2. Baseline definition
At this stage, it is always a good idea to establish a baseline. When the ground truth label has been externally defined, human-level performance (HLP) gives an estimate for Bayes error/irreducible error.
Regarding the HLP, low accuracy may indicate ambiguous labeling instructions. This may be observed when several labelers label the same sample. Improving label consistency will not only raise HLP, but also increase the model performance which is ultimately the goal of the project.
Apart from obtaining a baseline from HLP, other options are to obtain a baseline of similar solutions proposed in the literature or to obtain a baseline established by an old model.
The third stage is modeling, which is an iterative process between a training phase, an error analysis, and performance auditing.
This stage covers the training of the model to obtain the highest accuracy possible. There are two possible approaches:
- Model-centric view: Focus on building the best possible model for the available data. This approach focuses on holding the data fixed and iteratively improving the code/model.
- Data-centric model: It focuses on the use of tools to improve data quality. This approach focuses on holding the code fixed and iteratively improving the data.
In general, it is better to use the second approach for real applications. In other words, it is better to focus your training on good data rather than big data, since a reasonable algorithm with good data will often outperform a great algorithm with not-so-good data.
To get started with the modeling, first search the literature to see what is possible and find open-source implementations if they are available. Also, it is a good practice to try to overfit a small training dataset before training a large one. Lastly, implement a good tracking methodology (code used, hyperparameters, results, …) to have a clearer view of the experiments launched and how you got your most accurate models.
3.2. Error analysis
Once you trained the model, determine if you should put more effort into training the model with the current data or move on to the deployment stage by comparing our model performance to the established baseline.
The error analysis will tell you if there is room for improvement in each category. From this analysis, you can determine if it is necessary to collect more data, improve the accuracy of the labels or use data augmentation to improve the accuracy of classes that did not reach the baseline.
3.3. Performance auditing
Once you concluded the error analysis, check for accuracy, fairness, and bias. Error analysis, user feedback, and benchmarking of competitors can provide inspiration for adding new features.
In addition, think about the current results and brainstorm ways the system could go wrong in the deployment stage. Also, establish metrics to assess performance in the deployment stage to ensure there are no biases in some of the data while the model is in production. At this stage, stay as aligned as possible with the business or product owner to include as much domain knowledge as possible.
The deployment stage is the final step in the machine learning life cycle and delivers the final product to the customer. Within this stage, the two main challenges are:
- Concept drift (statistical properties of the target variable change over time) and data drift (statistical properties of the target variable change over time).
- Software engineering issues such as real-time vs batch, cloud vs browser, compute resources (CPU/GPU/memory), the maximum allowed latency, or security and privacy.
Same as the modeling stage, deployment is also an iterative process between model deployment, monitoring and system maintenance.
4.1. Model deployment
The common deployment cases are:
- There is a new product/capability
- There is a new system that replaces an old ML system
- There is a system that will automate/assist with manual task
Also, depending on the deployment case, there are different types of deployment:
- Shadow mode: Technique where the new data runs through a newly deployed model without that model actually returning the prediction to the customers.
- Canary deployment: Release an application or service incrementally to a subset of users
- Blue-green deployment: Deployment strategy that utilizes two identical environments, a blue and a green environment with different models. This is generally used when an old model is already running in production.
In general, it is nice (1) to have an easy way to roll back to the old system, and (2) to deploy the system with gradual ramping up and monitoring.
Lastly, when deploying the model with any of the mentioned techniques, this model does not necessarily need to replace a human. In fact, there are different degrees of automation, as displayed in the next figure.
4.2. Monitoring and system maintenance
This entails monitoring your ML models for changes such as data drift, concept drift, and model degradation, and ensuring that the model is maintaining an acceptable level of performance. It is important not only to monitor the output metrics that will be displayed to clients but also input metrics (number of missing values, distribution of the features, …) and software metrics (memory, latency, throughput, server load, …). To build the monitoring system, you can brainstorm with the team what things might go wrong and a few statistics/metrics/KPIs that will spot the problem.
4.3. System maintenance
From the monitoring system, we can spot new issues related to concept drift or data drift that need to be resolved. From these insights, we can then manually or automatically retrain the model with the new data so as to improve accuracy. In practice, while user data drift slower, business-to-business (B2B) applications tend to drift faster.
Nowadays, there are beginning to be two well-differentiated groups around machine learning projects. On the one hand, those working in the research field trying to bring new and innovative models to the ML community. On the other hand, those working in real-application machine learning projects who mainly use models found in the literature to then put them into production. For the latter, I consider that knowing the full machine learning life cycle and getting to know at least one of the sections is necessary to become a successful professional.