Can we manage IT Operations with Machine Learning and AI by predicting major disruptive incidents on time and resolve it before a service outage happens, or would it be too complex to run IT Operations with algorithms? The answer is a cautious yes, we can run IT Operations with algorithms and make predictions. And here is why. As AI and Machine Learning (ML) techniques are catching up in this peculiar sector, increasing IT complexity creates the need and pressure to manage various business applications and IT infrastructure more efficiently and intelligently. So the answer to that need is of course Artificial Intelligence together with Machine Learning, or in other words Artificial Intelligence for IT Operations (AIOps).
In this first blog post series on AIOps, we will take you into a deep dive in what AIOps is, how it provides seamless enterprise IT service to keep your lights on and how AI, Machine Learning and Data Science principles are used in order to do that. This article is just a first zoomed out, bird’s-eye view on AIOps. In the following blog posts, we will elaborate more on the technical background/details behind this approach.
Main pillars of AIOps
We call the work TOIL when it is tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and the one that scales linearly as services grow1. TOIL will drain your IT team’s energy and time when they have to stay on top of massive volumes of data, alerts and tasks. Enterprises can have up to 10 million events per day2. Therefore, the growing disconnect between large amounts of data, processes and the impact of monitoring on business calls for a new approach to managing and monitoring your IT infrastructure and services. Here is where AIOps comes into play.
Before we delve further into AIOps, it is important to state that a good observability on the state of the IT infrastructure and systems is a must. Being able to insert the right data into your AI and Machine Learning models is important. Some fundamental IT infrastructure might be needed for that. We will discuss that in further detail with the next blog post on domain centric and domain agnostic AIOps platforms. The main message here is: AIOps needs to have good observability and visibility on every aspect of the IT infrastructure and all the applications connected to it, to capture and monitor the complex interactions between the IT services and applications in the entire IT landscape. Once a component is missing and is not monitored or observed, when it stops working properly, it becomes a daunting challenge to find the root cause since you do not have eyes and ears in that part. That being said, let’s look at how AIOps works.
IT Operations Management maturity model
We can define AIOps as the use of Artificial Intelligence, Machine Learning and automation in IT Operations and transform the way IT Operations are managed by minimizing manual intervention of human operators. The goal is not to take the human out of the loop, on the contrary, it is there to help the human operators to manage the ever-increasing complexity in IT Operations. The IT complexity has three dimensions:
- Volume: exponential growth of data generation by the IT infrastructure and business applications.
- Variety: different types of data, like metrics, logs, events, traces, documents.
- Speed: the rate of data generation is increasing rapidly.
The need to stay ahead of major incidents and overwhelming alerts requires predictive, automated statistical tools that can tackle the challenges the three dimensions of IT complexity cause. Human capabilities on its own are not capable of delivering proper solutions to that.
AIOps solutions are designed to do that. We can characterize AIOps solutions at different platforms/vendors to four principles:
- Advanced data processing and analytics: ingestion of big data for real-time analysis of streams of data and historical analysis of stored data for training AI and ML models.
- Topological data analysis: mapping and discovering all the IT assets and applications across the IT landscape.
- Correlating events and other relevant data: mapping time and IT network topology to cluster related events. Additionally, discovering patterns and predicting events or incidents by continuously learning how the data behaves. The correlation is important to automate effective, efficient root cause analysis for IT service issues and incidents.
- Automated remediation: while monitoring the IT landscape continuously with AI and ML, in case an anomalous behavior occurs and an IT issue arises, AIOps recommends certain course of action for the human operator, or if enabled, triggers automated remediation to resolve the issue instantly.
An example workflow of AIOps with Anomaly detection system.
The underlying dynamic of AIOps is that Machine Learning algorithms can detect and predict anomalies that could cause IT service issues like a service outage or major disruptive incidents. Furthermore, based on the predictions and advanced data processing, much accurate, dynamic thresholds, rules, statistical baselines, events and alerts can be created. In a later series of blog posts, we will delve much deeper in how these advanced, complicated processes are being realized.
In this blog post I gave a brief overview of how we can manage IT Operations with AI and Machine Learning algorithms. The upcoming blog series will delve much deeper in each concept we discussed here and provide a more technical overview. So stay tuned and join our AIOps journey!
Akif Baser is a multidisciplinary engineer with a passion for research and development in Artificial Intelligence and Econometrics.