Service Operations with Predictive AIOps

Business Applications are the lifeline for every organization. Over decades, I have seen organizations carving their niche by adapting a plethora of technologies. For close to three decades, I worked on technologies that helped organizations minimize planned and unplanned operational outages, while delivering predictable application performance. We will walk through how the problem statement and solution approaches have evolved over time.

In mid-1990, I was with a team developing a Network Management suite on Windows platform. Those were the days where Computer Networks were evolving, and standards were converging towards IP Networks. Problem definition then was to ensure uptime of the network devices, like hub or switch or router; and there were no bad packets or retransmissions on the 10Base-T networks. I was tasked with developing an application to simplify setting thresholds for various network parameters on the network devices. When any threshold was breached, the network device would generate an SNMP alert to help the operations team address the issue.

The challenge then was planning the network metric that mattered the most, setting an appropriate threshold value, and ensuring the SNMP alerts were in fact delivered to the management platform. If the threshold value was set too low, there would be multiple alerts that bring down efficiency of the operations team. If the threshold value was set too high, we only see the issue after the computer network was impacted. In addition, the computer networks then could not handle too many alerts at a given time, as the computer networks themselves experienced alert fatigue. This was manageable Since the computer networks were not very complex and the number of metrics monitored were not huge.

In the first two decades of this millennium, I was working on mission critical and fault tolerant technologies with focus on predicting and preparing for issues that would impede the application service uptime. The scope of uptime included application availability and delivery of predictable performance while handling peak workloads. Orchestrating service level failovers, including takeover approaches were few techniques that we implemented to improve service level uptime and minimize the alert fatigue for the operations team. We built products to automate identification of key metrics to monitor, orchestrate processing of relevant alerts and automate changes to ensure high levels of application service availability.

Operations team would be informed on occurrences of critical events and corrective actions taken. This data would be used to plan and implement any changes essential to prevent potential outage conditions from occurring again.

Fast forward to today's environment, the problem statement has not changed much while the complexity of the technology environment has exponentially increased. We still monitor a set of critical metrics and choose the course of action taken when something goes wrong. Compared to the 90s when there were not that many critical applications, today, even email platforms have the Mission Critical tag as it forms the communication backbone for every organization. We monitored just the computer network then, while today, we must monitor everything from the computer network all the way to an individual application service.

We depend on multiple tools and tiers of tools to monitor business infrastructure, applications, and application services! Yet, we are still plagued by alert fatigue that impacts our operations team as application outages continue to persist. It is hard to manually identify all essential metrics to monitor issues and quickly remediate when an issue is detected.

The need of the hour is to automate identification of the right triggers (a combination of metric and tolerable threshold value) and quickly narrow down to the corresponding remediation for the trigger that occurred. With commoditization of Big Data and Machine Learning, we now have the ability to ingest and process significant volumes of operational logs to detect anomalies which helps us build “a virtual team of Operations SME on steroids”.

Using the operational logs, we can automatically construct operational baselines and detect breaches or even potential outages by observing trends. This simplifies the 'set threshold' operation and helps us auto pilot the monitoring of our critical technology ecosystem. Big data and ML solutions hosted on cloud platforms help us crowdsource newer metrics to monitor.

Coming to specifics of technologies to address today’s challenges, ServiceNow IT Operation Management helps in building a centralized record with CMDB to manage the IT infrastructure. By providing a centralized record of applications and infrastructure – and how they are related – it helps to quickly diagnose and remediate outages, minimize the risk of changes, optimize infrastructure spend, lower operational costs, and avoid software license compliance penalties. ITOM Discovery helps build and maintain this database up to date. Machine learning algorithms identify infrastructure elements, virtual machines, all the way to applications to build the CMDB.

ServiceNow ITOM Service Mapping automates the service mapping process, creating a complete, up-to-date, and accurate record of digital services in the CMDB. This works with Discovery to build the application services details on top of the infrastructure elements in the CMDB. With this, we have the complete inventory of network devices, servers, storage, applications, and application services available in a single database and they are organized in a way that helps the operations team identify resource relationships.

With the complete infrastructure inventory in CMDB, we can now leverage ServiceNow Predictive AIOps to simplify managing today's overwhelmingly complex and ever evolving technology environment. AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection, and causality determination. This approach brings in the monitoring event data from monitoring sources, then correlates events and identifies root causes with AI/ML capabilities.

ServiceNow Predictive AIOps helps simplify the complexities associated with identification of metrics and their threshold values while minimizing alert fatigue. This significantly reduces time to identify as well as remediate critical issues while improving the operations team's efficacy.