2025 - The Year In Observability

2025 has been yet another incredibly dynamic and eventful year in the observability space. From a technological point of view the big story was obviously AI, which moved from hype to practical reality, with vendors across the board implementing meaningful AI-driven innovations.

Functionally, we have seen vendors expanding the scope of their products - with features such as RUM, LLM observability and telemetry pipelines almost becoming standard. With the emergence of visionaries such as OllyGarden as well as the growing buzz around developer-driven observability, we are also seeing shift-left moment which recognises the importance of instrumentation quality.

The marketplace itself continues to boom. ClickHouse made their much anticipated move into the market, Dash0 saw spectacular growth and an incredible number of vendors have entered the market in specialisms such as pipelines, SRE tooling and LLM observability - to name but a few.

Once again, we have invited four of the leading figures in the observability space to share their thoughts, experiences and insights from the past 12 months.

Diana Todea

Diana Todea is a Developer Experience Engineer at VictoriaMetrics, with a strong background in SRE. She is Co-lead for the CNCF Neurodiversity working group and an OpenTelemetry Community Award winner.

Over the past year, observability has continued its transition from a niche engineering practice into a core capability for modern systems. In my own work, I have seen teams evolve from simply gathering metrics to truly understanding how signals relate to each other. When traces, logs and profiles are used together, they create a much clearer picture of how systems behave in real situations. One challenge that keeps coming up is the rising cost and complexity of data. High cardinality metrics, overwhelming log volumes and scattered toolchains often make it harder to find what actually matters. These pressures are now encouraging teams to adopt more efficient and sustainable approaches to observability, not only in terms of cost but also in terms of resource usage and operational impact.

Across the community, OpenTelemetry has been a major area of focus. The conversation has shifted from raw instrumentation to providing end users with genuinely easy onboarding, simpler setup and more automation. This reflects a broader desire for tools that reduce friction rather than add to it. Lightweight storage, cost-aware retention strategies and efficient querying are also becoming more important. Events throughout the year show a clear appetite for simplicity, transparency and better control over data.

Looking ahead, I expect observability to become closely connected with AI assisted workflows, helping reduce investigation time and making insights more accessible. I also believe sustainable data practices will grow in importance as organizations look to balance detail, cost and environmental responsibility. Observability will remain essential, but its future will be shaped by being smarter, more efficient and more mindful of the resources it consumes.

Bill Mulligan

Bill Mulligan is a Community Pollinator at Isovalent and is the author of the eCHO News newsletter. He is a champion of eBPF and cloud native computing as well as being a contributor to the Cilium project.

If I had to summarize 2025 in a sentence, it would be that observability finally matured from a data business into a systems problem again. Instead of living as a separate vendor-shaped appendage bolted onto the side of your stack, observability is becoming part of the software development lifecycle itself. So, what exactly does that mean in practice?

For one, the "ship everything to a SaaS backend and pray" model is collapsing under its own weight. Teams can't firehose infinite logs and metrics into someone else's cloud, wait 90 seconds, and pretend that's observability. Modern systems demand something closer to real-time introspection and ideally something that can sort, filter, and interpret the noise before it becomes a bill. The teams that get observability aren't the ones collecting the most telemetry. They're the ones producing actual insight at the moment and place where the failure occurs.

To do that, they're wrestling with the holy trinity of cost, cardinality, and control. I'm seeing eBPF based observability tools deal with this by moving towards more in-kernel filtering, more context-aware sampling, and less blind round-trips to centralized backends outside the kernel.

Observability isn't a tool you buy anymore, rather it's a property of the platform you build.

Adriana Villela

Adriana Villela is a Principal Developer Advocate at Dynatrace as well as being a CNCF Ambassador and OpenTelemetry SIG Maintainer. She is a prolific speaker and writer and also hosts the Geeking Out podcast.

As 2025 comes to a close, one thing that has stood out for me is that as a whole, the industry is past Observability's honeymoon phase. We've gone from, "How do we do it?" to "Are we doing a good enough job?" Which means that now we're seeing chatter around a couple of key topics.

One is the cost of telemetry. The more telemetry you emit, the more it costs you. Organizations are seeing their Observability bills skyrocket and are realizing that they can't emit All The Data. As a result, they are looking for ways to cut telemetry costs while still keeping their systems observable. What does that entail? Expect to see more on OTel topics like sampling and OTel Arrow as ways to address this.

Another hot topic is the quality of telemetry. You can emit all the telemetry in the world, but if it's bad, then your systems won't be super observable. Fortunately, the newly-launched Instrumentation Score is looking to address this. Altough not an official OTel initiative, many OTel folks are involved. This, combined with leveraging OTel Weaver's schema creation and validation capabilities, will be an unstoppable combination for improving telemetry quality.

Exciting times! I can't wait to see these topics evolve in 2026!

Juraci Paixão Kröhling

Juraci Paixão Kröhling is Co-Founder and CEO of OllyGarden, creating tooling to raise telemetry standards. He serves on the OpenTelemetry Governance Committee and is an emeritus maintainer of Jaeger.

This year, my attention was dedicated to the problem of "bad telemetry," as it's the main source of inefficiency in nearly every telemetry pipeline I've encountered. While we hear vendors saying louder and louder that companies are failing at observability because they are not sending enough telemetry, and that the metadata for that telemetry is insufficient, the reality is that most of the telemetry being generated is just junk: single-span traces for static assets or uninteresting health checks, outdated lists of IP addresses as resource attributes for metrics, PII or sensitive data captured by auto-instrumentation libraries. I've seen all of that, multiple times, this year.

There's one growing niche in observability that is starting to get really affected by this: AI tools, such as AI SRE agents. Bad telemetry confuses AI agents, causing them to take wrong turns and delaying resolutions (hint: it confuses humans too!)

When it comes to OpenTelemetry, this has been the year of Weaver, in my opinion. We had quite a few important developments and donations, not to mention the stabilization work across so many SIGs, but Weaver represents something bigger to me. It's the signal that we are maturing in our telemetry practices, moving from "capture everything" to "here are the tools to apply governance rules to your telemetry generation and collection."

Outside of my bubble, this was definitely the year we started crawling with the first AI products for observability. While last year the mandate at vendors was to understand how AI could be applied to their solutions, 2025 saw the first crop of that work. Most are unimpressive, over-promising and under-delivering, but that's how it is at this stage. If you've tried to generate a RED dashboard backed by OpenTelemetry metrics using an AI assistant, you know what I'm talking about. On a more positive note, most players published their MCP servers, allowing people to explore new workflows for interacting with their telemetry.

It's not too hard to predict that 2026 is going to be the year where we apply the learnings about what works and what doesn't when it comes to AI and observability. We'll see agents helping developers perform good instrumentation without the learning curve. We'll see observability move from "give me answers to questions I didn't think about before" to "tell me something I don't know but should." Or perhaps we'll have agents that just proactively surface useful insights without being asked.

And here's something else that 2025 showed us: we have many reasons to be excited about 2026.

A massive thanks to Adriana, Bill, Diana and Juraci for sharing their thoughts and insights. We look forward to following their respective journeys over the next year.

From the web

Articles we like from observability web sites and blogs

Everything You Ever Wanted To Know About Observability - in a Slide Deck
; Nov 30, 2025
If the Observability world had a code of secrecy akin to that of the Magic Circle, then Charity Majors might be in danger of being banished to exile and ignominy. In a single slide deck, she has blown the gaff on a whole trove of insider knowledge. It is the “What they don’t teach you at Harvard Business School” of observability knowledge. Not the abstract theory or technical detail but lessons and insights from the o11y frontline.

The deck in question was used in a talk at the LeadDev event in Berlin earlier this month and its 52 slides are an illuminating distillation of observability wisdom. We actually weren’t present at the talk and only came across the deck thanks to a mention in Michael Hausenblas’s excellent olly news newsletter. However, the slides contain sufficient clues (and images of unicorns) to easily re-construct the narrative and win friends and influence people as an observability savant.
The Art of Kubernetes Intrusion Detection
Fatih Koç blog Oct 22, 2025
If you are an SRE, when an outage happens you will know about it pretty quick. With security breaches the picture is rather less clear as, by their nature, they are designed to go undetected. Intrusion detection therefore is often based on a mixture of tools designed to spot unusual spikes, suspicious patterns or failed logon attempts.

This article by Fatih Koç argues that one of the major difficulties involved in identifying attacks is that of correlating signals across multiple sources such as Falco, Prometheus, Kubernetes Audit Logs etc. In this article, he outlines a strategy for extracting relevant data from each of these sources and pulling it together into a single observability dashboard.
Grafana Use a Canary to Fight Intruders
Grafana blog Sept 16, 2025
The first line of cyber defence is normally at the perimeter - preventing attackers from entering your network in the first place. The next line of defence is intrusion detection. This can often take the form of anomaly detection using a variety of heuristics.

There are also some more creative possibilities, such as the canary solution adopted by Grafana. Just as the canary in the coalmine sings to alert underground workers to the presence of toxic gases, Grafana’s canary was designed to alert them to the possible presence of intruders in their domain.
Acting On Impulse - How Airbnb Do Load Testing
Airbnb Tech blog June 10, 2025
Load testing can be simple in theory but in modern distributed architectures, it involves a lot more than throwing requests at an individual service. This article on the Airbnb engineering blog looks at how the company’s engineers use the Impulse load-testing framework to handle a number of more complex requirements such as dependency mockingand managing messaging and asyncronous calls.

Unfortunately, at the moment Impulse is just an internal Airbnb framework, so you won’t be able to get your hands on it at present. At the same time, the article provides a valuable blueprint for tackling advanced, real world load testing scenarios.
It’s eBPF for Windows!
Scorpio Software blog Mar 21, 2025
It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
Inside The C++ Black Box
Elastic blog blog Mar 10, 2025
As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.
Brendan Gregg - His Latest Flame
Brendan Gregg Blog Dec 19, 2024
Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan’s latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.
System Initiative - IaC Reinvented!
System Initiative Blog Dec 19, 2024
If you have ever had to grapple with a 3,000 line Helm chart to deploy your observability infrastructure, you may be forgiven for thinking that there must be a better way to do this. Whilst YAML has a certain formal elegance, its syntax struggles to express the architectures and relationships embedded in highly complex systems.

Whilst Pulumi have tackled this problem by enabling the use of high level programming languages for IaC, System Initiative are taking a fundamentally more radical approach. Their goal is nothing other than completely reinventing IaC from the ground up. The blog article for the launch of the product is an incredibly ambitious statement of intent. The terms ‘game changer’ and ‘paradigm shift’ tend to be thrown around somewhat casually, this might be a case where their usage is appropriate.

So, what are they proposing? Well, System Initiative is IaC without the code. It is a kind of digital canvas where you manipulate digital twins of your systems. Is the future here or is this the Platform Engineering equivalent of science fiction? Read the article and decide for yourself!