The State of AI in Observability Today

A Brief Round-Up

The recent advances in AI and Machine Learning open up enormous possibilities for the observability sector. Observability backends ingest vast amounts of telemetry and therefore have an ocean of raw data to mine for rich analytics and diagnostics. In addition to this, LLM's themselves have become a first class IT citizen in many organisations and are therefore also a subject for observability. Indeed, OpenTelemetry have now released a set of semantic conventions for working with Generative AI.

Observability is one of the most dynamic and competitive sectors in the IT industry and it is no surprise that vendors have begun to incorporate AI features into their platforms. It goes without saying that AI is a broad term and is used in different ways - it is generally used to cover both machine learning as well as generative AI. In this article we will briefly survey a number of leading platforms and products in the observability market and look at some of the ways they have incorporated AI capabilities into their products.

IBM Instana - Intelligent Remediation

IBM has a long and illustrious track record in the fields of AI and Machine Learning - including the 1997 victory of Deep Blue over chess grandmaster Gary Kasparov and IBM Watson’s winning turn on the US TV quiz Jeopardy in 2011. Watson has now been superseded by WatsonX, and this is the engine which powers the Intelligent Remediation feature in IBM’s Instana observability platform. Intelligent Remediation is a preview technology which continuously monitors a system for faults and anomalies. As well as drawing upon system telemetry, it also uses expert knowledge for causal analysis and then suggests remediations. The remediations can be implemented using pre-built actions selected from a catalogue. As well as the Remediation feature, Instana also has AI-driven capabilities for summarising, diagnostics and making recommendations.

Logz.Io - Anomaly Detection

Logz.Io is a popular full-stack observability platform built on top of open source technologies such as OpenSearch, Prometheus and Jaeger. Whilst their platform has been equipped with AI tooling for some time, the company are circumspect in not overplaying the current capabilities of AI. Whilst they recognise that AI can assist in areas such as reducing noise and summarising incidents, they do not make any claims in terms of causal analysis or remediation. You can learn more on the company’s AI posture from this really illuminating webinar.

The Logz.Io platform ships with an Observability IQ Assistant, which harnesses AI to support natural language querying and chat-based analytics on your telemetry data. The most powerful AI feature in the Logz.Io platform though, is probably the Anomaly Detection tooling that is integrated into the App 360 module. One problem with anomaly detection is that it is not business-aware, and, if not applied carefully, it may end up creating yet more alert fatigue. To combat this, Logz.io Anomaly Detection allows users to target critical services and take a more SLO-driven approach.

Elastic - Supercharging Search

The ELK Stack has been at the forefront of the log aggregation and analytics space for many years. Despite controversies over licensing, Elastic is still a hugely popular and influential product. Highly powerful search capabilities are at the core of its product offering and it is no surprise that this is a domain where Elastic seeks to differentiate itself from other platforms in terms of its AI tooling. It seems though, that Elastic’s ambitions extend far beyond log searching and it is positioning itself as a first-choice platform for advanced corporate data analytics.

The centrepiece of this vision is the Search AI Lake, which incorporates RAG, search and security functions and is built on a cloud-native architecture. The company claims that this enables search over vast volumes of data at high speed and low cost. A quick glance at Elastic's latest financial report really highlights the strategic importance of AI to the company. Pretty much every item listed in the Product Innovations and Updates section is AI-related. Search is obviously an area with great potential for AI and other vendors such as AWS have also incorporated AI into their search functionality. At the moment, this technology is still at the experimental stage, but when it matures natural language search over telemetry data will be a huge win for making observability systems accessible and of value across the enterprise.

New Relic - Observability for LLM's'

The AI revolution poses a two-fold challenge for observability vendors. As well as harnessing AI to create more powerful systems, they also need to extend their functional scope to provide insights into the LLM functionality that customers are building into their systems. New Relic were the first major vendor to add LLM monitoring to their stack - although Datadog and Elastic have now followed suit. The New Relic AI monitoring product will check for “bias, toxicity, and hallucinations“ as well as identifying processing bottlenecks and scanning for potential vulnerabilities. As well as the usual APM signals, the tool also captures AI-specific metrics such as response quality and token counts. Sending data to LLM's can obviously represent a potential security issue, so the system also includes safeguards for protecting sensitive data.

Grafana

Grafana's initial application of AI to their stack concentrated on reducing toil and providing 'delight' for the user. This entailed functionality such as generating incident summaries or providing automated suggestions for names and titles of panels and other objects. Recently though, they have started to ramp up their AI features. One of the most notable of these is AI-powered insights for continuous profiling. Flame graphs are a great tool, they can, however, be visually very dense and it can take some time to unpack all the data and identify root causes and bottlenecks. The Grafana Cloud Profiles tool now supports an AI-powered flame graph reader to speed up and simplify diagnostics and analysis.

Causely

Whilst most of the systems in this review harness AI to complement their existing stack, Causely is built on AI from the ground up. As the name suggests, it uses Causal AI - built on expert systems knowledge - to carry out root cause analysis as well as predictive diagnostics. This contrasts with most other systems whose root cause analysis is actually powered by correlation and inference - which are less reliable approaches. Causely is not a full stack system, instead it plugs in to your existing stack. If you are interested in digging deeper into Causely and causal AI then take a look at our recent feature article.

Open Source

It is not only the large vendors who are harnessing AI capabilities to build new products and features. K8sGPT is an open source project aiming to ease the burden for K8S admins by tapping into AI backends for assistance with diagnostics. Like much AI-based tooling, it works as a co-pilot rather than an autonomous operator. The tool is built on a set of Analysers which map to K8S resources such as pods, nodes, services etc and continually scan your cluster, looking for errors. It then sends a digest of the error context to the backend AI (it doesn’t have to be OpenAI) and presents the potential fixes to the user.

Langtrace AI is an open source tool offering observability for LLM Apps. It can be self-hosted, but there is also a SAAS version of the product. It provides full OpenTelemetry tracing support and also provides metrics around costs, accuracy and latency. It offers support for the Pinecone and ChromaDB vector databases and integrates with OpenAI and Anthropic LLM’s. There is also an integration for viewing your traces in SigNoz. There is an ambitious list of new features on the project’s backlog and it is likely to evolve quickly.

Conclusion

It is still early days in terms of the incorporation of AI capabilities in observability systems. However, there are a few trends we can see emerging and AI features seem to be coalescing around:

reducing toil
causal analysis/anomaly detection
natural language search
LLM observability

Some vendors, such as Cleric, have made some very bold claims about creating an AI Site Reliability Engineer and others have spoken about "closing the loop" but, in reality, this is a long way off. The changes we are seeing are incremental rather than transformational. Innovations such as natural language search will make tasks such as querying easier and more accessible, but these are assistive technologies. As the author of a recent article on the Incident.io blog put it, AI is best envisaged as an exoskeleton rather than a robot.

Comments on this Article

You need register and be logged in to post a comment

From the web

Articles we like from observability web sites and blogs

It’s eBPF for Windows!
Scorpio Software blog Mar 21, 2025
It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
Inside The C++ Black Box
Elastic blog blog Mar 10, 2025
As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.
Brendan Gregg - His Latest Flame
Brendan Gregg Blog Dec 19, 2024
Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan’s latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.
System Initiative - IaC Reinvented!
System Initiative Blog Dec 19, 2024
If you have ever had to grapple with a 3,000 line Helm chart to deploy your observability infrastructure, you may be forgiven for thinking that there must be a better way to do this. Whilst YAML has a certain formal elegance, its syntax struggles to express the architectures and relationships embedded in highly complex systems.

Whilst Pulumi have tackled this problem by enabling the use of high level programming languages for IaC, System Initiative are taking a fundamentally more radical approach. Their goal is nothing other than completely reinventing IaC from the ground up. The blog article for the launch of the product is an incredibly ambitious statement of intent. The terms ‘game changer’ and ‘paradigm shift’ tend to be thrown around somewhat casually, this might be a case where their usage is appropriate.

So, what are they proposing? Well, System Initiative is IaC without the code. It is a kind of digital canvas where you manipulate digital twins of your systems. Is the future here or is this the Platform Engineering equivalent of science fiction? Read the article and decide for yourself!
How Zomato Souped Up Their Metrics With VM
Zomato Blog Sep 14, 2024
Zomato is a restaurant aggregator and food delivery service that generates vast volumes of metrics. As their company grew, they adopted a Prometheus/Thanos-based architecture - running some 144 Prometheus servers. As metrics volumes continued to skyrocket, even this architecture started to creak and the Zomato SRE team began the search for an alternative solution.

In this article on the Zomato blog, the team discuss why they opted to migrate to Victoria Metrics as well as discussing a number of features of the system which enable them to achieve better performance, lower costs and greater scalability.

The technical challenges were pretty daunting - the project involved migrating over 800 dashboards, 300 microservices and 2.2 billion active time series. We would commend this article not just for its technical insights but also for taking a warts-and-all approach in documenting some of the technical limitations of the VM solution.
Obirdability - Fowl Play With Grafana!!
Grafana Blog Jul 29, 2024
Grafana dashboards have been put to all sorts of uses over the years - for everything from space missions to monitoring milk production. In this fun but highly informative article Ivana Huckova and Sven Grossman walk us through building an observability system for bird song. Whilst this might sound slightly quirky, the techniques could be applied to all manner of applications which need to record and analyse audio inputs.

The article is a great showcase for a number of Grafana capabilities - including installing Alloy on a Raspberry Pi and adding context to Dashboard data by dynamically query sources such as Wikipaedia and the Open-Meteo weather information service.
Internal Observability at Uber
Uber Blog Jun 10, 2024
Stories about Uber architecture always seem to be interesting, not least because they always involve technology at huge scale - such as this trillion record migration from DynamoDB. This article, however, is actually interesting on a number of levels. As well of being of technical interest it also provides some fascinating insight into internal team topologies and management processes - which are also fundamentally important aspects of managing observability at scale. Whilst most organisations will only operate at a fraction of Uber’s scale, every organisation is seeking to minimise costs and improve service to users, and the article provides a number of insights which would be of interest to most observability practitioners.
Observability Principles for ML Models
Datadog Blog May 16, 2024
A survey carried out by McKinsey in 2021 found that 57% of respondents were already using Machine Learning to support at least one business function. ML is no longer a niche concern but is becoming a core component of development and CI/CD practices. As this post from the Datadog blog notes, the efficacy of ML models will inevitably degrade over time, so monitoring their performance and reliability is critical. The article really drives home the point that ML is a domain with its own specific behaviours, and effective monitoring requires building out new processes, metrics and even infrastructure to cover issues such as Data Drift, Prediction Drift and Concept Drift. Whilst the article does use some specialist terms, it is a highly readable and practical guide to the subject of ML monitoring.
Observing Observe with Observe
Observe Apr 13, 2024
It sounds like it could be a sub-plot in the film Inception, but this is a really interesting article from the Observe blog on how they use an instance of their Observe system to monitor their Observe cloud platform. Observe not only have to support fast reads for complex user queries, they also have to support ingesting one petabyte of telemetry per day. As you can see from the above diagram, Kafka and Snowflake form two of the pillars of the backend architecture. This three-part series offers a fascinating insight into Observe’s own internal observability strategy as well as being a great exemplar of the eat your own dog food principle. This is an article which is of great value to anybody with an interest in large-scale observability architectures.
The $1m Line of Code
InfoQ Apr 5, 2024
Most of us have experienced the anguish of bill shock at some point. Being hit with a huge bill for mobile roaming charges on return from your holiday or getting a penalty notice for an inadvertent motoring infringement that happened weeks back. Those are just small pinpricks though, compared to the 50,000 volts of financial burn felt by companies mentioned in this transcript of a scintillating talk by Erik Peterson, CEO of CloudZero. He argues, persuasively, that engineering decisions are buying decisions. In the case mentioned in the headline, a decision to turn on one section of debug code led to vast volumes of logs being emitted and racking up over $1m in costs.
An Engineer's Personal Retrospective'
CEP Mar 9, 2024
This is a really engaging blog post by Infrastructure Engineer Jack Lindamood, where he reviews nearly every infrastructure decision he made over four years working at a start-up. Each choice is graded with a Regret, Endorse or an occasional Unsure. Whilst not explicitly observability-related, it will however, have resonance for any engineer forced to make technological choices (which is probably all of us). The article contains much distilled wisdom and some strong opinions, as well as general observations on the challenges and trade-offs faced by infrastructure engineers.
Finding relationships in your data with embeddings
Medium Feb 8, 2024
The RAG pattern has really gained traction over the past year as it allows enterprises to leverage the power of LLM's to gain insights into their own data. This is a fascinating and (occasionally technical) article which details how Incident IO used vector embeddings to mine through their data and discover related incidents. The article explains the techniques involved with great clarity and provides really helpful advice on creating embeddings to find hidden patterns in your own data.
How Chik-Fil-A Run 2,800 Edge Clusters
Medium Dec 29, 2023
When you think of large corporations pushing the technology envelope, Chik-Fil-A might not be the first name to come to mind. However, the highly distributed nature of their infrastructure presents massive observability challenges, which they have met with some very impressive engineering. The scale of their task is daunting - 2,800 Edge Kubernetes clusters, tens of thousands of IoT devices and billions of MQTT messages each month. This is a really fascinating article on managing IoT observability at scale.
Production-Ready Observability Platform for AI Systems
Medium Nov 3, 2023
In this blog article, Bijit Ghosh of Deutsche Bank discusses best practices for observability across the full AI system lifecycle. He composes a custom system which knits together a range of technologies including structlog, Flask, Prometheus and Kibana as well as AI-specific tools such as MLFlow and CausalML. It’s a comprehensive article which exhibits a clear understanding of both observability and AI technologies.
Infrastructure Monitoring with the TIG Stack
CNCF Blog Sept 21, 2023
A great example of managing the complexities of Observability engineering. Jay Taylor from InfluxDB builds out a solution using the Telegraf, InfluxDB, Grafana stack.
Deploying a Kubernetes monitoring stack
rtfm July 23, 2023
An in-depth look at monitoring K8S with the increasingly popular VictoriaMetrics platform. This follows an end-to-end process from crafting your own Helm chart to configuring alert rules.
"You're overpaying for OpenTelemetry's verbosity"
rtfm Oct 10, 2023
This has really raised a few eyebrows. A forensic analysis by Nikolay Sivko of coroot on how just a few OpenTel meta tags can potentially explode your ingestion fees.