Entangled Observability – Through the AI Wormhole

Making sense of the AI revolution is not easy – not even for those who are leading it. When you are caught up in the middle of the whirlwind, gaining any sense of perspective seems impossible. It is hard to figure out what observability might look like once the storm blows itself out – assuming it ever does. Alas, though, the human mind has been engineered to seek out patterns and try and impose some kind of conceptual order on the chaos that surrounds us. So, as foolish as it may be, here are some thoughts as we pass through the wormhole.

AI SRE – the Observability Hothouse

One of the biggest stories of the moment is the spectacular rise of the AI SRE. A category that was unknown not long ago is now a booming sector with dozens of vendors. These tools are not shiny toys or gimmicks. They are industrial-strength applications taking over complex tasks such as triage, root cause analysis and even remediation. They are proof that AI has climbed out of the trough of disillusionment and is handling demanding production workloads.

For a while I asked myself whether, by modelling the AI SRE around a human role, were we shunting AI tools into silos that reflect the limits of human capabilities. Instead, something quite momentous – and probably unexpected - has happened. Driven on by intense competition, the relentless and increasingly rapid cycles of innovation in the sector have thrown up revolutionary insights that question some of the most fundamental assumptions of observability practice.

The Dark Matter of Observability

Most observability platforms have been founded on analysing the signals emitted by an application at run time. Whether it takes the form of logs, metrics and traces or whether it is wide events, this is the visible mass of telemetry. In a ground-breaking LinkedIn post Kyle Forster and his team recently revealed that in their analysis, these classic signals only made up 30% of the information needed to understand system incidents. They also identified the nature of the dark matter that constitutes the other 70% of actionable data. And it is not large cosmic structures, instead it is dust clouds of informal chat and local knowledge as well as the continual background noise of configuration change.

Ultimately, this is a truth that has probably been staring us in the face for some time. The majority of system outages are no longer caused by misbehaving application code. Your system is far more likely to be brought down by a DNS change than an Invalid Use of Null. A case in point is this recent post mortem on the outages suffered by GitHub in recent months. One of culprits was a reconfiguration of caching strategy. A change that set in motion an unforeseeable chain reaction that eventually brought parts of the system to a standstill.

Context is King

This doesn’t mean that we should throw the MELT paradigm out of the window. It does mean, though, that the next stage of observability evolution involves building up richer context. Several studies have shown that a vast amount of the telemetry that we produce is, essentially, redundant. As OllyGarden have demonstrated, much of this telemetry is not just redundant – it is actually bad. What we need is quality over quantity and more intelligent correlation with a broader range of background signals. OtterMon are one vendor who are really pushing this notion to its limits. They claim that by sampling just one percent of telemetry flows they can build up a profile of system normality and then use that as the baseline for anomaly detection

Dissolving Boundaries

As well as the discovery of this dark matter, there are other emerging trends which can redraw the shape of observability. One of these is the erosion of existing functional boundaries. Cybersecurity is perhaps the most promising candidate for unification. Currently, organizations pay to ingest the same petabytes of network logs into two different buckets: one for performance (Observability) and one for threats (SIEM). Historically, these were two different departments because they looked at the same data through different lenses: one for health and one for threats. Companies such as Splunk and Dynatrace have already developed powerful SIEM capabilities and it seems inevitable that AI will lead to greater convergence in this area.

Software, Software, Software

One thing we can predict with a good deal of confidence is that there will be more software – a lot more software. Unfortunately, as software development processes become more accelerated - and democratised, that code is increasingly likely to be unreliable. As this article notes, yes, we have more code, but we also have more Sev Ones.

In this LinkedIn post, Evgeny Popatov even argues that 'bad' code is almost the default. Anthropic are shipping code that is changing the world and doing it at warp speed. Instead of checkpoints and handoffs, the control system is one of carefully defined guardrails and continuous observability. However, this is not observability that functions as a brain in a jar once code has been shipped. This is an agile and proactive observability. An entangled observability that provides verification at every step.

The Meltdown of the SDLC

Perhaps as Boris Tane has argued, the SDLC is being stripped down and compressed. As processes accelerate then roles themselves become more fluid. This is the new normal addressed by tools such as Hud, an app that dispenses with the traditional three pillars and instead uses a code sensor to build up application context so that vibe coders can resolve issues and deploy changes in development-driven micro-cycles. This model may not work for everybody but it is the new way of working for teams that want to move at AI velocity.

The advent of tools such as OpenClaw suggests that we are facing a proliferation of software which may potentially present ever greater risks and which will be written outside the bounds of the traditional process. Observability will not just shift left, it will also have to be more proactively intelligent and improve its capabilities for prevention and prediction.

The leftward shift itself is not a new phenomenon. It has been happening for sometime and is a trend that has developed independently of the rise of AI. A number of vendors have already incorporated CI/CD observability into their portfolio and the OpenTelemetry project has a CI/CD working group. Datadog also extended their vision of end-to-end observability with their recent acquisition of QA specialist Propolis.

Observability is the Guardrail

Arguably though, AI is accelerating the shift – bringing observability into every step in the SDLC. Kerno, which started out as a reliability platform, recently pivoted towards an AI-driven integration testing platform. Antithesis are developing a highly powerful testing platform aimed at ramping up reliability by harnessing AI to find potential failures in any given execution path. PlayerZero, meanwhile are attempting to predict which PRs might result in a production failure.

As we look more closely, we can see a profound transformation where development, AI and observability coalesce. Observability is not a stage that succeeds development, instead it is a reflex that triggers continuously in the inner loop. It becomes ambient This is a point made eloquently in this LinkedIn post by Sesh Nall – Head of Observability at Datadog. His team harnessed agents for the construction of gigantic software operations. Unfortunately, agents lose coherence at this scale so that guardrails are needed to remediate memory rot and drift. These guardrails are, in effect, observability interventions that correct deviance from the specified goal.

The Death of SaaS?

One of the most spectacular predictions accompanying the march of AI is that it will sound the death knell for traditional SaaS. When Satya Nadella added his voice to the chorus, it was a refrain that almost took on the air of a fait accompli. If you can vibe-code a CRM or a billing engine over a weekend, why pay a monthly subscription to ServiceNow or Stripe? We have already seen headline grabbing stories about stock values crashing as AI upstarts roll out updates that eat the dinner of established SaaS vendors.

The Lines of Defence

I think, though, that the SaaS edifice has at least two lines of defence. Yes, within a few minutes a coding assistant can generate an API to ingest OpenTelemetry data. But writing the code is only 5% of the challenge. Turning that code into a product means building infrastructure that can handle petabytes of streaming data with 99.99% reliability – not to mention the small matter of security, failover storage and low-latency indexing.

Your AI will be able to specify a generic architecture, but whether it will really be able to reason over the best specific architecture for your use case or to be able to innovate to meet new challenges or get one step ahead of the competition is another question entirely. An app may accumulate stars on GitHub, but making a business out of it requires GTM, sales and support. Reliability, scalability and cultural capital are still pretty big moats.

The Training Is The Product

As the cost of generating code hits the floor, the intelligence embedded in the trained model becomes the new commodity and the new source of intellectual property. The thousands of iterations that go into building context and understanding into a model can’t be replicated in an afternoon of vibe-coding. To train an enterprise-level model you need vast reservoirs of real-world historical data and insights that can only be gained from endless cycles of training. Companies with 10 years of logs have a massive advantage over a startup starting with an empty database.

The Human Outer Loop

There is a thesis that, at some point, when we achieve AGI, the curve of intelligence will go vertical, leaving humanity behind like galaxies at the edge of the universe accelerating away at faster-than-light speed.

The counter-argument to this is that AI operates in two distinct loops. There is an inner loop of seemingly exponential technological progression. A realm where LLMs make cognitive leaps that even their designers no longer understand. Then there is the outer loop, the human layer that involves defining boundaries and assessing risks.

Despite the speed of AI-automated "inner loops," the ultimate velocity of progress will, hopefully, still be controlled by the human outer loop. For the past couple of years constraints have been stripped away and we have seen an arms race towards ever more powerful frontier models. However, it seems as though the power of models such as Mythos has had a sobering effect. Even the de-regulation purists are baulking at the thought of bad actors exploiting technology this powerful.

This is my take – and arguably it is an optimistic one – i.e. that governance will prevail over a catastrophic free-for-all. Some of the most eminent voices in the field have predicted much darker outcomes. Which of these two scenarios will materialise – we don’t know. The future, like our AI, is non-deterministic.

From the web

Articles we like from observability web sites and blogs

Mission Impossible? Delivering Reliability Through the Air Gap
April 4, 2026
The excellent Alex Ewerlöf blog is now back in full swing and in this latest article he turns his attention to dealing with a real SRE curveball - how to build reliability engineering for an air-gapped system. We are talking hermetically sealed - not even a maintenance window for external connectivity. This task involved a high-security military facility where installing updates meant physically handing over an archive file to a system operator.

The constraints were pretty stringent - no logs, no metrics, no traces, no remote access of any kind.This was an extreme case and, ultimately, the solutions had to be both human-centred as well as relatively low-tech. How would you address the challenge? Hit the link below to read about the solution that Alex put in place.
GitHub Outages - What Went Wrong?
March 20, 2026
In the past month or so the GitHub platform suffered a number of well-documented outages which resulted in loss of service for users. In the spirit of transparency, GitHub CTO Vlad Fedorov published this article on the GitHub blog, explaining the causes of the outages and the lessons learned as well as detailing the remediations that GitHub engineers will be putting in place.

The article really brings home the challenges of orchestrating the components of a global technology infrastructure - as well as the compounding effects of working at very large scale. The investigation revealed a perfect storm of edge cases, hidden tipping points and unforeseen knock-on effects. It’s impossible not to feel for the engineers sweating in the war-rooms as the dramas unfolded - after all, watching your failover fail must be pretty gut-wrenching.
Everything You Ever Wanted To Know About Observability - in a Slide Deck
Nov 30, 2025
If the Observability world had a code of secrecy akin to that of the Magic Circle, then Charity Majors might be in danger of being banished to exile and ignominy. In a single slide deck, she has blown the gaff on a whole trove of insider knowledge. It is the “What they don’t teach you at Harvard Business School” of observability knowledge. Not the abstract theory or technical detail but lessons and insights from the o11y frontline.

The deck in question was used in a talk at the LeadDev event in Berlin earlier this month and its 52 slides are an illuminating distillation of observability wisdom. We actually weren’t present at the talk and only came across the deck thanks to a mention in Michael Hausenblas’s excellent olly news newsletter. However, the slides contain sufficient clues (and images of unicorns) to easily re-construct the narrative and win friends and influence people as an observability savant.
The Art of Kubernetes Intrusion Detection
Fatih Koç blog Oct 22, 2025
If you are an SRE, when an outage happens you will know about it pretty quick. With security breaches the picture is rather less clear as, by their nature, they are designed to go undetected. Intrusion detection therefore is often based on a mixture of tools designed to spot unusual spikes, suspicious patterns or failed logon attempts.

This article by Fatih Koç argues that one of the major difficulties involved in identifying attacks is that of correlating signals across multiple sources such as Falco, Prometheus, Kubernetes Audit Logs etc. In this article, he outlines a strategy for extracting relevant data from each of these sources and pulling it together into a single observability dashboard.
Grafana Use a Canary to Fight Intruders
Grafana blog Sept 16, 2025
The first line of cyber defence is normally at the perimeter - preventing attackers from entering your network in the first place. The next line of defence is intrusion detection. This can often take the form of anomaly detection using a variety of heuristics.

There are also some more creative possibilities, such as the canary solution adopted by Grafana. Just as the canary in the coalmine sings to alert underground workers to the presence of toxic gases, Grafana’s canary was designed to alert them to the possible presence of intruders in their domain.
Acting On Impulse - How Airbnb Do Load Testing
Airbnb Tech blog June 10, 2025
Load testing can be simple in theory but in modern distributed architectures, it involves a lot more than throwing requests at an individual service. This article on the Airbnb engineering blog looks at how the company’s engineers use the Impulse load-testing framework to handle a number of more complex requirements such as dependency mockingand managing messaging and asyncronous calls.

Unfortunately, at the moment Impulse is just an internal Airbnb framework, so you won’t be able to get your hands on it at present. At the same time, the article provides a valuable blueprint for tackling advanced, real world load testing scenarios.
It’s eBPF for Windows!
Scorpio Software blog Mar 21, 2025
It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
Inside The C++ Black Box
Elastic blog blog Mar 10, 2025
As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.
Brendan Gregg - His Latest Flame
Brendan Gregg Blog Dec 19, 2024
Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan’s latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.
System Initiative - IaC Reinvented!
System Initiative Blog Dec 19, 2024
If you have ever had to grapple with a 3,000 line Helm chart to deploy your observability infrastructure, you may be forgiven for thinking that there must be a better way to do this. Whilst YAML has a certain formal elegance, its syntax struggles to express the architectures and relationships embedded in highly complex systems.

Whilst Pulumi have tackled this problem by enabling the use of high level programming languages for IaC, System Initiative are taking a fundamentally more radical approach. Their goal is nothing other than completely reinventing IaC from the ground up. The blog article for the launch of the product is an incredibly ambitious statement of intent. The terms ‘game changer’ and ‘paradigm shift’ tend to be thrown around somewhat casually, this might be a case where their usage is appropriate.

So, what are they proposing? Well, System Initiative is IaC without the code. It is a kind of digital canvas where you manipulate digital twins of your systems. Is the future here or is this the Platform Engineering equivalent of science fiction? Read the article and decide for yourself!