Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

Ensuring Trustworthy ML Systems With Data Validation and Real-Time Monitoring | by Paul Iusztin | Jun, 2023

admin by admin
June 4, 2023
in Artificial Intelligence


Theoretical Concepts & Tools

Data Validation: Data validation refers to the process of ensuring data quality and integrity. What do I mean by that?

As you automatically gather data from different sources (in our case, an API), you need a way to continually validate that the data you just extracted follows a set of rules that your system expects.

For example, you expect that the energy consumption values are:

  • of type float,
  • not null,
  • ≥0.

While you developed the ML pipeline, the API returned only values that respected these terms, as data people call it: a “data contract.”

But, as you leave your system to run in production for a 1 month, 1 year, 2 years, etc., you will never know what could change to data sources you don’t have control over.

Thus, you need a way to constantly check these characteristics before ingesting the data into the Feature Store.

Note: To see how to extend this concept to unstructured data, such as images, you can check my Master Data Integrity to Clean Your Computer Vision Datasets article.

Great Expectations (aka GE): GE is a popular tool that easily lets you do data validation and report the results. Hopsworks has GE support. You can add a GE validation suit to Hopsworks and choose how to behave when new data is inserted, and the validation step fails — read more about GE + Hopsworks [2].

Screenshot of GE data validation runs inside Hopswork [Image by the Author].

Ground Truth Types: While your model is running in production, you can have access to your ground truth in 3 different scenarios:

  1. real-time: an ideal scenario where you can easily access your target. For example, when you recommend an ad and the consumer either clicks it or not.
  2. delayed: eventually, you will access the ground truths. But, unfortunately, it will be too late to react in time adequately.
  3. none: you can’t automatically collect any GT. Usually, in these cases, you have to hire human annotators if you need any actuals.
Ground truth/targets/actuals types [Image by the Author].

In our case, we are somewhere between #1. and #2. The GT isn’t precisely in real-time, but it has a delay only of 1 hour.

Whether a delay of 1 hour is OK depends a lot on the business context, but let’s say that, in your case, it is okay.

As we considered that a delay of 1 hour is ok for our use case, we are in good luck: we have access to the GT in real-time(ish).

This means we can use metrics such as MAPE to monitor the model’s performance in real-time(ish).

In scenarios 2 or 3, we needed to use data & concept drifts as proxy metrics to compute performance signals in time.

Screenshot with the observations and predictions overlapped over time. As you can see, the GT isn’t available for the latest 24 hours of forecasts [Image by the Author].

ML Monitoring: ML monitoring is the process of assuring that your production system works well over time. Also, it gives you a mechanism to proactively adapt your system, such as retraining your model in time or adapting it to new changes in the environment.

In our case, we will continually compute the MAPE metric. Thus, if the error suddenly spikes, you can create an alarm to inform you or automatically trigger a hyper-optimization tuning step to adapt the model configuration to the new environment.

Screenshot with the mean MAPE metric between all the time series computed over time [Image by the Author].



Source link

Previous Post

(ML) MobileNetV2: Inverted Residuals and Linear Bottlenecks | by YEN HUNG CHENG | Jun, 2023

Next Post

Open Data: Unleashing Opportunities and Challenges

Next Post

Open Data: Unleashing Opportunities and Challenges

Solving Unsolvable Combinatorial Problems with AI

Configure and use defaults for Amazon SageMaker resources with the SageMaker Python SDK

Related Post

Artificial Intelligence

How to Implement Random Forest Regression in PySpark | by Yasmine Hejazi | Sep, 2023

by admin
September 26, 2023
Machine Learning

Mastering The Method of Choosing Your Most Accurate Machine Learning Algorithms: A Comprehensive Guide

by admin
September 26, 2023
Machine Learning

Mastering How to Calculate the Return on Equity: A Guide

by admin
September 26, 2023
Deep Learning

How Observability in DevOps is Transforming Dev Roles

by admin
September 26, 2023
Artificial Intelligence

Innovation for Inclusion: Hack.The.Bias with Amazon SageMaker

by admin
September 26, 2023
Edge AI

Flex Logix Expands Upon Industry-leading Embedded FPGA Customer Base

by admin
September 26, 2023

© Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.