Grey matter still matters. Maybe the best way to stem the tide of slop is to keep the human in the gloop.
There is a tendency to think that the major problems of observability are data problems or engineering problems or cost problems or maybe design problems. There is also a temptation to think that we can best serve users by reducing cognitive load - enacting the "Don't Make Me Think" mantra. So we throw AI and processing power and smart engineering and cheap storage into the mix. This can be effective but in some ways it is attacking the symptom rather than the problem.
Should observability really be about designing engineering solutions that brute-force their way through mountains of redundant telemetry in order to separate out the signals from the noise. Many systems today boast that they are capable of signal correlation - which is obviously a benefit. But maybe it begs the question of how the signals actually got silo-ed apart in the first place.
There are a few articles that have been published recently that point the way to a more holistic and preventive strategy for building observability systems. These are approaches that focus on crafting telemetry at the user end rather than processing telemetry at the vendor end. Rather than leaving observability vendors with the problem of processing excessive volumes of incoherent telemetry, maybe we could address the problem at source by getting smarter with instrumentation.
The first articles are part of a series being published by Juraci Paixão Kröhling. Juraci is an oTel contributor, former Grafana Engineer and the founder of OllyGarden. If you check out the OllyGarden web site and or read any of the blog articles you will get the sense that this is a project which has an ecological sensibility. The garden is not just about cultivating good telemetry for the sake of good engineering but also because all of those wasted CPU cycles do also burn CO2.
The premise of O11yGarden is that the problem of bad telemetry is one that should be tackled at the root. Having systems that can ingest and filter at scale is great, but what if we just create better quality and leaner telemetry in the first place. This then raises the question - what is bad telemetry? Well, this is where the OllyGarden really blossoms - it actually provides tooling for analysing your telemetry and then evaluating it against a set of objective criteria. It will then provide detailed feedback on on how you can improve your instrumentation. This has the potential to do for observability engineers what tools like Resharper did for software engineers. It is an assistant that helps you do better instrumentation. In doing so you have higher quality and more efficient telemetry.
In some ways, observability is a relatively young discipline. Computer programming has been around for several decades and there is a huge body of canonical knowledge on best practice. In observability, however, there is no Knuth. There is no Observability 101 on instrumentation. Yes, there are many, many engineers writing individual blogs and making great recommendations on particular tactics or tweaks but achieving best practice is often a matter of trial and error undertaken by individuals working in isolation.
The OllyGarden tooling changes all this. It is both a CT scan on your code and a mentor. It will analyse your instrumentation, it will detect 'code smells' and then it will suggest cleaner and more efficient alternatives. This addresses two potential sources of bad telemetry.
The first of these sources is badly crafted custom telemetry. This can take many shapes and forms. Many developers may never hand-craft a metric or a trace. For most of us, logging is as far as we go. In some ways, rolling your own telemetry is like rolling your own security. Yes, you can do it, but the potential for doing it badly is pretty huge. In security contexts, the result is that you create more vulnerabilities than you solve. In telemetry, it means that the code you craft to monitor issues can just create other issues. The obvious example if the proverbial cardinality explosion. As the OllyGarden team show, though, there are many other types of aberration that may spike your telemetry volumes and your costs.
The other source is seemingly more innocuous but may be still be problematic, and that is using instrumentation SDK's. For many of us, there is a temptation to think that auto-instrumentation with oTel and other instrumentation SDKs is a no-brainer and a bit of a free lunch. Unfortunately, this involves an element of wishful thinking. Yes, these packages are created by very smart people, but they inevitably make assumptions and work on averages and generalised use cases which may not be relevant or performant for your application.
The problem is that the majority of SME's do not have engineers who have both the time and skills to make sure that instrumentation is optimised and customised to the requirements of their application. The beauty of OllyGarden is not that it will throttle or divert your telemetry to a cheaper backend. Instead, it will support you in writing better telemetry.
The second major contribution to the Intelligent Observability approach was this article by Austin Parker on the OpenTelemetry blog. Again, this is quite a short piece and it is not screaming in block capitals with a click-baity title like "Logging - you are doing it wrong!!!". At the same time though. When you digest the article, the thought you may be left with is "LOGGING - AM I DOING IT WRONG??!!".
How can this be? Surely it is hard to get logging wrong? Surely everybody should be grateful that I am even bothering to manually log my code, especially if I am going above and beyond the call of duty by actually logging key business events rather than just logging my exceptions.
Yes, if you are logging key events and transactions that is cool. That is great. If you are using structured logging - even better. However, if you really want to take your logging to the next level, then maybe we can return to our earlier point about the separation of signals. The point here is that many events which are important enough to be logged do not occur discretely or in isolation. Often, they occur within the context of a transaction. In that case, why not attach those logs to that context at their point of creation. In general, this means embedding those logs within a trace.
Again, this is a practice which is, for the present time at least, one which requires an injection of human intelligence. It requires familiarity with the business logic of your code and making intelligent decisions about the nature of your logging. You need to understand which events are important and what are the important attributes you need to gather. Unlike the problem that OllyGarden is addressing, this does not necessarily mean that you need technical knowledge of the most efficient way to construct a histogram, but it does mean that you need to have knowledge of your business processes and form opinions about what information is important.
This then obviates the need for correlation. The signals are already combined. This is an argument that Honeycomb have been making for many years in their advocacy of the "wide event". This has always been an attractive concept but given that it has been mostly pushed by Honeycomb there has been a certain hesitancy around embracing it. Is it genuinely useful or is it just the preference of a particular vendor that just happens to work with their particular data structures.
Well, they are now not the only advocates of this model, ClickHouse have also thrown their weight behind this approach in their article for the launch of the ClickStack solution. There seems to be an increasing consensus that traces are the golden signal of distributed service instrumentation and that they are the natural container for many log events that were previously just generated as discrete and disconnected entities that need to be subsequently pulled together into a debugging context.
I started off this piece by saying "it's not about AI". I guess I should qualify this with a "As far as I know" or "Not yet". Who knows, given that OllyGarden are developing some formal systems for measuring and improving telemetry quality, there is obviously the possibility that before long machines can learn and apply these rules. For the moment though, the human is very much part of this semantic loop.
Comments on this Article