A first step to MLOps
I have been interested in MLOps for a while now. I first learned about it from machine learning engineers, and as I was a PhD student at the time, I was not aware of its existence. However, my curiosity was piqued and I began learning about it. Looking back, I regret not learning about it sooner in order to optimize my machine learning workflow.
In this article, I will attempt to provide a beginner-friendly introduction to MLOps and explain the key concepts in a simple way. As someone who also found it challenging to understand at first, I understand the need for a simpler introduction to this topic. My hope is that after reading this article, a beginner will feel more comfortable reading more advanced documentation on MLOps.
Table of contents:
· 1. Motivation towards MLOps
· 2. Definition
· 3. MLOps lifecycle
· 4. MLOps workflow
∘ 4.1. Business Problem
∘ 4.2. Data Engineering
∘ 4.3. ML model Engineering
∘ 4.4. Code Engineering
· 5. Conclusion
Due to the successes of machine learning techniques in various research fields, many companies have sought to incorporate them into their software systems to improve efficiency and solve real-world problems. However, implementing machine learning in production environments can be a challenging and time-consuming process for many companies. Additionally, once deployed, the model must be managed and maintained, and its performance must be monitored to ensure it is functioning properly. These tasks can be especially difficult in large software systems.
On the other hand, software engineers use the DevOps (Development and Operations) paradigm, a set of practices and tools that facilitate collaboration and communication between development and operations teams, to develop and manage their systems. This helps maintain development speed and quality. MLOps aims to adapt these DevOps principles for machine learning systems. With this background in mind, what is MLOps exactly?
To define MLOps, let’s begin by examining various definitions:
Definition 1:
“MLOps (Machine Learning Operations) is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products.” [1]
Definition 2:
“The extension of the DevOps methodology to include Machine Learning and Data Science assets as first-class citizens within the DevOps ecology” [2]
Definition 3:
We can use the definition of Machine Learning Engineering (MLE), where MLE is the use of scientific principles, tools, and techniques of machine learning and traditional software engineering to design and build complex computing systems. MLE encompasses all stages from data collection, to model building, to make the model available for use by the product or the consumers.” (by A.Burkov) [3].
Based on the previous definitions, we can understand MLOps as a set of techniques and practices used to design, build, and deploy machine learning models in an efficient, optimized, and organized manner. These techniques and practices are often discussed within the context of the MLOps lifecycle.
The MLOps lifecycle consists of the steps and techniques involved in the MLOps paradigm, from designing and developing a machine learning model to deploying it in a production environment and monitoring and maintaining it over time. It is typically divided into three main stages:
- The first stage is the design process, which involves defining the business problem, the model’s requirements, and its intended use-case. This often involves creating an AI/ML canvas.
- The second stage is the model development process that includes data and model engineering.
- The third stage is the operations process that covers model deployment and maintenance.
It is important to maintain the performance of the model over time after it has been deployed, so these stages are typically carried out in a cyclic manner. This ensures that the model is performing well and still meeting the needs defined in the first stage. Now that we have discussed the stages of the MLOps lifecycle, let’s examine the MLOps workflow, which outlines the specific tasks and activities that are performed at each stage of the process.
The MLOps workflow outlines the steps to follow in order to develop, deploy, and maintain machine learning models. In an ideal world, following the workflow would be sufficient: first, the business problem is understood, then the model is chosen, trained, and deployed. However, this is not always the case in the real world. At any point, it may be necessary to return to a previous step. In addition, after deploying the model, it must be maintained and monitored, which is why it is important to understand both the MLOps lifecycle and the MLOps workflow.
4.1. Business Problem
The first step in the MLOps workflow is understanding the business problem, which involves defining the model’s input and output, as well as the process and its various subtasks. To structure this process, you can use the AI (Artificial Intelligence) canvas or the ML (Machine Learning) canvas, which can be thought of as templates for organizing the MLOps workflow. The AI canvas generally provides a high-level structure for ML/AI implementation, while the ML canvas provides a high-level description and specifics of the system. You can read more about these canvases here.
Let’s take an example! Let’s say in order to improve its products, a dairy company is interested in gathering feedback from its consumers about it. To do this, sentiment analysis is needed to be performed on consumers comments about the products that are made on social media platforms. Machine learning techniques can be used to train a model to classify the sentiment of these comments as positive, negative, or neutral. This will allow the company to better understand its customers’ experiences with its products and identify areas for improvement. This business problem description transformed into an AI canvas and/or an ML canvas for a clearer representation:
- Prediction/ prediction task: The AI system will analyze text input and predict the sentiment of the text (positive, negative, or neutral).
- Judgment: The system will use natural language processing techniques to understand the meaning and sentiment of the text.
- Action/ decisions: Based on the predicted sentiment, the system may take different actions, such as flagging negative reviews for further review or prioritizing positive social media posts for promotion.
- Outcome: The desired outcome is for the system to accurately classify the sentiment of the text input, leading to improved customer satisfaction, better social media engagement, or other benefits depending on the specific use case.
- Training: The system will be trained on a dataset of labeled text data, containing both the input text and the corresponding sentiment label.
- Input/ Data sources: The system will accept text input from a variety of sources, such as social media posts or customer reviews.
- Output / Making predictions: The system will analyze text input and predict the sentiment of the text (positive, negative, or neutral).
- Feedback: The system may incorporate feedback from users or stakeholders to improve its performance over time, for example by adjusting the parameters of the natural language processing algorithms or adding new data to the training dataset.
- Offline evaluation: The system will be evaluated using standard evaluation metrics such as precision, recall, and F1 score to ensure that it is accurately classifying the sentiment of the text input.
- Live monitoring: The system will be continuously monitored and updated as needed to ensure that it continues to perform accurately over time.
4.2. Data Engineering
After understanding the business problem at hand, the next step in the MLOps workflow is the data engineering process. This includes data ingestion, exploration and validation, data cleaning, data labeling, and data splitting.
- Data ingestion involves using a set of techniques to gather the data, create backups, protect private information, create a metadata catalog, and sample a test set to avoid data snooping bias.
- To explore and validate the dataset, a set of statistics and visualization techniques are used.
- Gathered data often has noise, contains outliers, and has missing values. These issues can affect the next process, so the data cleaning step is applied to address them.
- Data labeling is necessary when the chosen model is based on supervised learning. This step can be done manually, automatically, or semi-automatically.
- Data splitting is the final step in this process and involves dividing the data into training, validation, and test sets.
4.3. ML model Engineering
The third step in the MLOps workflow is machine learning engineering, which includes model training, model evaluation, model testing, and model packaging.
- Training models involves feature engineering, code review and versioning, and hyperparameter tuning. You may wonder why feature engineering is included in this step rather than the previous one. The reason is that many types and architectures of models are tested in this step, so the feature engineering is often not the same for all the models. It’s worth noting that several models are trained and tested before selecting the most appropriate model in this step.
- Model evaluation involves validating the model to ensure that it meets the business objectives described in the business problem step.
- In the model testing step, the model acceptance test is performed using the initial test set.
- Once the model is validated and tested, the final step is to export the model in a specific format so it can be served to the business application.
4.4. Code Engineering
In this step, the model is ready to be deployed to production. Model deployment consists of three steps: model serving, performance monitoring, and performance logging.
- To serve a model, the serving pattern and deployment strategy must be considered. The serving pattern refers to how the model is integrated into the software, such as integrating it as a service, as a dependency, using precomputed serving, on-demand serving, or hybrid serving. The deployment strategy refers to the method used to wrap the model, such as deploying it as a Docker container or as serverless functions.
- Monitoring the model involves observing the overall behavior of the model, such as the deviation of its predictions from previous model performance.
- Performance logging involves saving the results of the model’s predictions in a log record.
In this article, we provided a brief introduction to MLOps. We discussed the need for MLOps, presented various definitions, explained the MLOps lifecycle, and described the MLOps workflow. If you would like to learn more about MLOps, I recommend ml-ops.org for additional information.
This is the first article on MLOps and certainly not the last! I will be writing more tutorials on MLOps and its various technologies, with examples, so stay tuned. If you have any questions or suggestions, feel free to leave me a comment below.
[1] Kreuzberger, D., Kühl, N., & Hirschl, S. Machine learning operations (mlops): Overview, definition, and architecture, 2022. doi: 10.48550. arXiv preprint arXiv.2205.02302.
[3] https://ml-ops.org/content/motivation#mlops-definition
All images and figures in this article whose source is not mentioned in the caption are by the author.