IT Complexity And The Future Of AIOps Reboots

Karthik Sj, General Manager, AI at LogicMonitor. Built & Scaled multiple 0-1 AI products across public, PE and VC backed companies.

getty

The world came to a grinding halt on July 19 when 8.5 million Windows computers and devices received a faulty CrowdStrike update. The outage was so significant that it wiped out operations across the globe, affecting banks, hospitals, airlines and other critical businesses.

Such global IT outages reveal the vulnerabilities organizations face with modern day IT infrastructure. In the rapidly evolving landscape of IT, organizations grapple with several significant challenges.

Challenges In Modern IT Infrastructure
Fragmented observability and monitoring tools: The first emerging challenge IT organizations face is around fragmented observability and monitoring tools. With every tool specialized to observe nuanced telemetry data from certain sections of the IT stack, operators often need to piece together events from multiple tools and correlate them to understand the signal from the noise.

Cross-domain data management: The next challenge involves managing data from cross-domains, including IT Operations (MELT, Incidents), Developer Operations (Changes, CMDB, Knowledge Base, War room transcripts) for effective root cause analyses (RCA) using hidden data within Incidents or Proprietary Knowledge bases which are not accessible to AI systems.

Complexity outpacing insights: The third emerging challenge is that while the rate of IT and data complexity has grown significantly, the rate of proportional insights has not kept pace. In fact, companies have reported that metrics such as mean time to resolve (MTTR) have increased across the board, resulting in longer times to detect, diagnose, and troubleshoot. For example, during the CrowdStrike outage, it took a significant amount of time to get a RCA.

These challenges underscore the need for a more integrated, AI-driven approach to IT operations management that can handle the increasing complexity and scale of modern IT environments.

Limitations With Traditional AIOps
While traditional AIOps solutions have existed in the market, they have largely failed to deliver on the promise of what AIOps was meant to achieve as a category.

Limitations range from manual correlation and rule-based approaches, which do not scale at the same rate as IT complexity. Correlation itself is a table stakes capability but the approach to value is where vendors will differ and where traditional vendors are still using machine learning rules to get the job done.

Additionally, modern AIOps platforms often lack the foundational observability pipelines needed to handle cross-domain, cross-modality data, which are crucial for enabling Generative AI (GenAI) capabilities.

These limitations highlight the need for a new generation of AIOps solutions that leverage advanced AI techniques, particularly GenAI, to provide more adaptive, scalable and intelligent IT operations management.

Emergence of GenAIxOps: Opportunities And Challenges
While GenAI presents exciting possibilities for IT operations, it’s crucial to approach this technology with a strategic mindset. The emerging field of “Generative AI cross-domain Operations” or “GenAIxOps” promises to redefine IT management, offering powerful new tools to address long-standing challenges in the industry.

GenAIxOps represents a paradigm shift in IT operations management, leveraging the power of LLMs & Agentic architecture to provide more intelligent, context-aware and proactive IT management capabilities. This new shift has the potential to revolutionize how IT teams operate, diagnose and troubleshoot complex issues.

There is an opportunity to optimize the workflow from individual alerts emitted by systems and infrastructure to a unified, correlated alert and incident. This process can be summarized, root-caused and remediated with GenAI.

GenAIxOps goes beyond traditional AIOps by not only analyzing and correlating data but also generating actionable insights, predicting potential issues and even suggesting remediation steps. This proactive approach can significantly reduce the likelihood of major outages like the CrowdStrike incident.

While GenAIxOps’s potential is significant, organizations should be mindful of a few things during implementation. Ensuring high-quality, representative data across all IT domains is crucial for effective GenAIxOps deployments. By these considerations and developing strategies to address them, organizations can better position GenAIxOps’ transformative potential.

Trends And Use-Cases In GenAIxOps
With the emergence and maturity of GenAI in the enterprise, there are several opportunities for leaders to integrate it into the IT Operations workflow. Here are some of the use-cases and opportunities for AI that I’ve seen being explored.

AI-Generated Summary

Business leaders can use GenAI to make understanding complex system-generated alerts and esoteric technical jargon easier into human-readable summaries. Now, all receiving team members, including CIOs, can get caught up on major incidents

AI-Suggested Root Cause

GenAI can suggest highly localized, accurate and context-specific root causes by combining extensive observability telemetry data with ITSM data, such as incidents, changes and information from emails, collaboration platforms and call transcripts. This enables rapid, AI-powered root cause analysis (RCA), reducing the time required from weeks to seconds.

AI Remediation

Large language models (LLMs) are great at generating text and it is now possible to generate step-by-step playbooks/runooks on the fly once the issue has been identified and diagnosed.

AI Assistant

Enabling all users, especially non-technical users, to talk to your observation data using a conversational interface. Questions can range from how-to’s, simple and complex queries and even advanced troubleshooting use-cases.

The benefits of applying these AI use-cases are massive. Customers are experiencing reduced IT costs for their ITOps teams, starting with improved alert compression and fewer ITSM incidents. Additionally, GenAI increases productivity by suggesting root causes and remediations, which reduces mean time to resolve, thereby unlocking operational efficiencies for IT Operations teams.

The ROI of GenAIxOps extends beyond cost savings, encompassing improved system reliability and increased agility in responding to IT challenges. Organizations implementing GenAIxOps can expect a significant reduction in major incidents and faster MTTR.

Summary
With the evolution of IT architectures, the explosion of data growth and the injection of powerful technologies like GenAI, companies need to adapt to new workflows. Categories like GenAIxOps can help create these workflows, unlocking massive operational efficiencies for organizations. We are seeing that GenAI is past its peak hype cycle and customers should choose to embrace a phased approach to re-deploy existing and traditional AIOps tools to pave the way for modern hybrid observability.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

{Categories} _Category: Takes{/Categories}
{URL}https://www.forbes.com/councils/forbestechcouncil/2024/08/23/it-complexity-and-the-future-of-aiops-reboots/{/URL}
{Author}Karthik Sj, CommunityVoice{/Author}
{Image}https://imageio.forbes.com/specials-images/imageserve/65ce4afb61927bb2bcdf683f/0x0.jpg?format=jpg&height=600&width=1200&fit=bounds{/Image}
{Keywords}Innovation,/innovation,Innovation,/innovation,technology,standard{/Keywords}
{Source}POV{/Source}
{Thumb}{/Thumb}

Exit mobile version