Many vendors I talk to are not even sure if their product is an observability product and then still others, whose products almost certainly are observability products, don't want to market them as observability products because they think that the space is becoming saturated. We have now kind of reached a consensus on what observability does not mean - i.e. that observability is not just about the three pillars. At the same time though, there is not much consensus about what it does mean. At the empirical end of the spectrum that are those who define observability as processing telemetry and then at the more abstract end there are positions such as "observability is about being able to ask questions of your system". You can have your observability burger your way! Some of us want to know if our K8S pods are overheating, and some of us want to understand our sales pipeline. But that's actually ok. As we say, it's a big tent.
What happened?? Why did that system go down? If we want to answer that question, then we obviously need to have historical data. The problem is that modern computer systems generate huge volumes of telemetry. Since we never in know in advance which system might fail, we play safe and cover all our bases. This gives us a warm feeling, but it also means that we have the engineering problem of ingesting that data and the economic problem of paying for it. To be honest, to an extent, the engineering problem of mass ingestion has been solved. Unfortunately, the problem of querying that mountain of data so that we can quickly make sense of it, is still hard.
But it's still not really possible. Ultimately, as an SRE or a DevOps engineer, I would rather not have to figure out what went wrong. I would really like a system that would be able to help me prevent the outage in the first place. Sure, it's great that my system can alert me that a pod went down with an OOM Kill, but if I'm paying a six or seven-figure sum for a state of the art system - can't it actually figure this stuff out pre-emptively? I mean, I have fed it several petabytes of historical data - why can't it be a bit more predictive? At the moment, it turns out that this is still a hard problem.
Root cause analysis is a seductive phrase but, in practice, it is something of a chimera. Unfortunately, the mechanics of cause and effect are not always visible. Even more unfortunately, causes themselves are not always mechanical. In loosely coupled, complex and highly distributed systems, they can sometimes only be inferred. And, as a growing body of theory tells us, failures in complex systems are often not mono-causal. Maybe it is even the case that attempting to pinpoint a single cause for an error condition is a particular prejudice of human inquiry rather than a self-evidently correct approach.
Complexity is pretty much taken for granted in modern IT landscapes. It is part of the wallpaper. And then we pile one layer of complexity on top of another. Distributed services, interconnected network topologies, enterprise messaging backbones. They are designed by clever people and they do complex, high tech stuff. The terabytes of logs, metrics, traces, events etc generated by these systems don't tell a story by themselves. Turning these huge heaps of data points into meaningful analyses requires deep domain knowledge as well as heavy-duty engineering and some highly skilled UI design. The complexity of observability systems is a function of the complexity they observe.
Very few organisations actually have dedicated observability specialists - at best they may have one or more DevOps or Platform engineers that have some familiarity with some aspects of one or two observability products. Most organisations don't have staff with experience and expertise in instrumenting their systems optimally or in the best ways to filter, forward, consolidate and configure that telemetry. Traversing a wide and unfamiliar landscape without a map or a compass is not a simple endeavour.
Procuring a unified observability system is actually a considerable undertaking. Effectively, you cannot really evaluate the system without re-instrumenting some of your existing services and infrastructure. It is not easy to do this without impacting an existing environment or spinning up a new one. You will also have to coordinate across multiple teams and set up a whole variety of testing scenarios. Most organisations simply do not have the time to go through this process over and again so that they can compare vendor tools. Often this means that customers don't end up with the best tool for the job.
If you are a developer and you want to know about OOP or the principles of RESTful API's, the chances are that the canonical texts are not written by a particular vendor. More likely they are the product of collaborations by networks of subject matter experts or reflect a coalescence of academic traditions.
In the observability realm, much of the narrative-making tends to have a more vendor-led feel. Those vendors though, are often focused on the concerns of big-ticket clients. The "problems", therefore, are often stated as dealing with petabyte scale ingestion or cardinality explosions or traces with tens of thousands of spans. This results in heroic feats of engineering that captivate audiences at conferences (me included!), but it may not reflect the actual day to day concerns of many practitioners.Organisations that have adopted OpenTelemetry will not have to go through the pain of re-instrumenting their code in order to evaluate or switch to a new vendor. Equally eBPF offers the possibility of zero code instrumentation for companies running compatible workloads. And, of course, AI tooling holds out the prospect of automation to help mitigate the observability skills gap. Will this lead us to a technological nirvana of systems functioning in perfect harmony? Hopefully not, the hard problems are the best ones.
Comments on this Article