banner image

The State of AI in Observability Today

A Brief Round-Up

The recent advances in AI and Machine Learning open up enormous possibilities for the observability sector. Observability backends ingest vast amounts of telemetry and therefore have an ocean of raw data to mine for rich analytics and diagnostics. In addition to this, LLM's themselves have become a first class IT citizen in many organisations and are therefore also a subject for observability. Indeed, OpenTelemetry have now released a set of semantic conventions for working with Generative AI.

Observability is one of the most dynamic and competitive sectors in the IT industry and it is no surprise that vendors have begun to incorporate AI features into their platforms. It goes without saying that AI is a broad term and is used in different ways - it is generally used to cover both machine learning as well as generative AI. In this article we will briefly survey a number of leading platforms and products in the observability market and look at some of the ways they have incorporated AI capabilities into their products.

IBM Instana - Intelligent Remediation

IBM has a long and illustrious track record in the fields of AI and Machine Learning - including the 1997 victory of Deep Blue over chess grandmaster Gary Kasparov and IBM Watson’s winning turn on the US TV quiz Jeopardy in 2011. Watson has now been superseded by WatsonX, and this is the engine which powers the Intelligent Remediation feature in IBM’s Instana observability platform. Intelligent Remediation is a preview technology which continuously monitors a system for faults and anomalies. As well as drawing upon system telemetry, it also uses expert knowledge for causal analysis and then suggests remediations. The remediations can be implemented using pre-built actions selected from a catalogue. As well as the Remediation feature, Instana also has AI-driven capabilities for summarising, diagnostics and making recommendations.

Logz.Io - Anomaly Detection

Logz.Io is a popular full-stack observability platform built on top of open source technologies such as OpenSearch, Prometheus and Jaeger. Whilst their platform has been equipped with AI tooling for some time, the company are circumspect in not overplaying the current capabilities of AI. Whilst they recognise that AI can assist in areas such as reducing noise and summarising incidents, they do not make any claims in terms of causal analysis or remediation. You can learn more on the company’s AI posture from this really illuminating webinar.

The Logz.Io platform ships with an Observability IQ Assistant, which harnesses AI to support natural language querying and chat-based analytics on your telemetry data. The most powerful AI feature in the Logz.Io platform though, is probably the Anomaly Detection tooling that is integrated into the App 360 module. One problem with anomaly detection is that it is not business-aware, and, if not applied carefully, it may end up creating yet more alert fatigue. To combat this, Logz.io Anomaly Detection allows users to target critical services and take a more SLO-driven approach.

Elastic - Supercharging Search

The ELK Stack has been at the forefront of the log aggregation and analytics space for many years. Despite controversies over licensing, Elastic is still a hugely popular and influential product. Highly powerful search capabilities are at the core of its product offering and it is no surprise that this is a domain where Elastic seeks to differentiate itself from other platforms in terms of its AI tooling. It seems though, that Elastic’s ambitions extend far beyond log searching and it is positioning itself as a first-choice platform for advanced corporate data analytics.

The centrepiece of this vision is the Search AI Lake, which incorporates RAG, search and security functions and is built on a cloud-native architecture. The company claims that this enables search over vast volumes of data at high speed and low cost. A quick glance at Elastic's latest financial report really highlights the strategic importance of AI to the company. Pretty much every item listed in the Product Innovations and Updates section is AI-related. Search is obviously an area with great potential for AI and other vendors such as AWS have also incorporated AI into their search functionality. At the moment, this technology is still at the experimental stage, but when it matures natural language search over telemetry data will be a huge win for making observability systems accessible and of value across the enterprise.

New Relic - Observability for LLM's'

The AI revolution poses a two-fold challenge for observability vendors. As well as harnessing AI to create more powerful systems, they also need to extend their functional scope to provide insights into the LLM functionality that customers are building into their systems. New Relic were the first major vendor to add LLM monitoring to their stack - although Datadog and Elastic have now followed suit. The New Relic AI monitoring product will check for “bias, toxicity, and hallucinations“ as well as identifying processing bottlenecks and scanning for potential vulnerabilities. As well as the usual APM signals, the tool also captures AI-specific metrics such as response quality and token counts. Sending data to LLM's can obviously represent a potential security issue, so the system also includes safeguards for protecting sensitive data.


Grafana's initial application of AI to their stack concentrated on reducing toil and providing 'delight' for the user. This entailed functionality such as generating incident summaries or providing automated suggestions for names and titles of panels and other objects. Recently though, they have started to ramp up their AI features. One of the most notable of these is AI-powered insights for continuous profiling. Flame graphs are a great tool, they can, however, be visually very dense and it can take some time to unpack all the data and identify root causes and bottlenecks. The Grafana Cloud Profiles tool now supports an AI-powered flame graph reader to speed up and simplify diagnostics and analysis.


Whilst most of the systems in this review harness AI to complement their existing stack, Causely is built on AI from the ground up. As the name suggests, it uses Causal AI - built on expert systems knowledge - to carry out root cause analysis as well as predictive diagnostics. This contrasts with most other systems whose root cause analysis is actually powered by correlation and inference - which are less reliable approaches. Causely is not a full stack system, instead it plugs in to your existing stack. If you are interested in digging deeper into Causely and causal AI then take a look at our recent feature article.

Open Source

It is not only the large vendors who are harnessing AI capabilities to build new products and features. K8sGPT is an open source project aiming to ease the burden for K8S admins by tapping into AI backends for assistance with diagnostics. Like much AI-based tooling, it works as a co-pilot rather than an autonomous operator. The tool is built on a set of Analysers which map to K8S resources such as pods, nodes, services etc and continually scan your cluster, looking for errors. It then sends a digest of the error context to the backend AI (it doesn’t have to be OpenAI) and presents the potential fixes to the user.

Langtrace AI is an open source tool offering observability for LLM Apps. It can be self-hosted, but there is also a SAAS version of the product. It provides full OpenTelemetry tracing support and also provides metrics around costs, accuracy and latency. It offers support for the Pinecone and ChromaDB vector databases and integrates with OpenAI and Anthropic LLM’s. There is also an integration for viewing your traces in SigNoz. There is an ambitious list of new features on the project’s backlog and it is likely to evolve quickly.


It is still early days in terms of the incorporation of AI capabilities in observability systems. However, there are a few trends we can see emerging and AI features seem to be coalescing around:

  • reducing toil
  • causal analysis/anomaly detection
  • natural language search
  • LLM observability
Some vendors, such as Cleric, have made some very bold claims about creating an AI Site Reliability Engineer and others have spoken about "closing the loop" but, in reality, this is a long way off. The changes we are seeing are incremental rather than transformational. Innovations such as natural language search will make tasks such as querying easier and more accessible, but these are assistive technologies. As the author of a recent article on the Incident.io blog put it, AI is best envisaged as an exoskeleton rather than a robot.

Comments on this Article

You need register and be logged in to post a comment