The observability space is a dynamic and heterogeneous realm. It is a habitat where commercial giants such as Datadog, New Relic and Splunk co-exist alongside the big OS/Freemium stacks such as ELK, Prometheus/Loki/Grafana and TIG (Telegraf/InfluxDB/Grafana). In recent years though, three technological trends have coalesced to reshape the landscape and pave the way for new stacks built on a new generation of powerful open source technologies and standards. The three new elements in the observability periodic table are:
Each of these are independent elements with totally different characteristics and functions but a new breed of observability contenders are assembling them into powerful frameworks.
OpenTelemetry is the second most active project in the CNCF (behind Kubernetes) and is the most significant project in observability today. Its principal benefit is to bring a lingua franca to the pre-existing Tower of Babel of competing protocols and semantics. It has also, however, unleashed two additional side-effects. Firstly, it empowers customers by freeing them from vendor lock-in. Secondly, it is a powerful enabler for new entrants into the market. OpenTelemetry compliance is attractive to consumers and new entrants to the market can enjoy the advantage of adopting it natively, whereas incumbents have to retro-fit or re-engineer existing codebases. OpenTelemetry also currently provides SDK's in 12 languages - thus obviating much of the pain and effort in writing your own agents. In addition to this, with a few Helm commands, you can deploy the OpenTelemetry Collector - providing you with a robust and immensely powerful gateway for ingestion of telemetry.
eBPFThe second element is eBPF. Opening up the Linux Kernel may seem like an esoteric techie endeavour. Its impact, however, has been seismic. It follows that if applications can safely access the kernel then they can have complete visibility of processes. From this stream of visibility flows the raw material for constructing traces and calculating metrics directly. To be fair, transforming that raw material into telemetry (and doing so without an excessive footprint) is far from trivial. At the same time though, the building blocks for that transformation can also be found in many open source repositories. Being able to tap into those repositories and build zero-instrumentation products endows observability vendors with game-changing capabilities.
The third element is ClickHouse, the alien-tech of our brave new observability architecture. It has touched down like an escapee from Lockheed Skunkworks that can achieve faster than light speeds. Meanwhile, its compression algorithms warp the fabric of diskspace so that it is bigger on the inside than on the outside. Ok - maybe that is a bit over the top. However, ClickHouse has emerged as the backend database for a remarkable number of disruptors and startups - Groundcover, SigNoz and DeepFlow to name but a few. Each time the rationale is the same. The ClickHouse engine is open source, can ingest at vast scale and query at speed. Its ultra-efficient compression also means cost savings which can be passed on to customers. As it''s open source there are also no database license fees.
A number of savvy and agile startups such as OneUptime, KloudMate and Groundcover have realized that these three components provide cost-free building blocks for assembling extremely powerful, robust and cost-effective systems - we might call it the OeC stack. All of the heavy lifting of storage, capture and structuring of telemetry are already pre-built. In the Open Source Hotel, there is such a thing as a free lunch. In fact, with OpenTelemetry, eBPF and ClickHouse you can feast on a free breakfast, lunch and dinner. Obviously, you cannot get these components to work together without sophisticated engineering and domain knowledge - however, the barriers to market have been drastically lowered.
With the advent of "cloud-native" computing, it is even possible to exclude OpenTelemetry and run on a slimmed down eBPF/ClickHouse stack. Obviously, the term "cloud native" is somewhat nebulous and is used very loosely as a marketing term. Effectively, it tends to mean running microservices in cloud-hosted K8S clusters. If your application follows this pattern then, theoretically, an eBPF-driven engine can generate all your traces and metrics with zero instrumentation. This is the premise of Cloud Native solutions such as Groundcover, DeepFlow and Coroot (although Coroot does also support OpenTelemetry for logs and traces). There are also other products currently in stealth mode that are adopting this architecture.
In the real world, it is unlikely that many systems will have such a conveniently clean and simple topology. Most distributed systems will also make calls to serverless apps, third party API's and messaging systems. Having requests originate outside of the boundaries of K8S is not necessarily a deal-breaker though. Products such as Grafana Beyla are capable of appending spans to incoming traces, so that the telemetry chain is not broken.
An increasing number of vendors are questioning the LMT (Logs, Metrics, Traces) paradigm and pulling away into a post-instrumentation era. Even though Observability thought leaders at companies such as Honeycomb natively support OpenTelemetry, they have been vocal about the fragmented nature of the current paradigm and advocated for a pattern based on "arbitrarily-wide structured log events". Perhaps occupying conceptually similar ground is the approach taken by vendors such as Observe and Dynatrace, who ingest multiple telemetry types into huge data lakes for combined analysis.
These trends lead to a nagging existential question around the whole OpenTelemetry project. It is, as we have said earlier, a magnificent edifice. But what if two of its three main pillars become redundant? Does it suffer the same fate as other meticulously constructed endeavours which are suddenly rendered obsolete by an evolutionary leap. The answer is probably not. Logs are still an essential pillar of observability and providers will still need to ingest them somehow. Even here, though, Groundcover's solution requires zero logging configuration. It's eBPF-powered Alligator agent will ingest logs automatically and forward them to an internal OpenTelemetry Collector. Equally, APM is not the only purpose of metrics collection - the use case of collecting device metrics for infrastructure, IoT etc monitoring and diagnostics will not be going away.
In this listing we will look at vendors who have adopted at least two of the three elements of the stack. Obviously, many companies support OpenTelemetry but those listed here rely on it as their principal ingestion mechanism.
oTel | eBPF | ClickHouse | |
---|---|---|---|
SigNoz | ✓ | ✓ | |
OneUptime | ✓ | ✓ | ✓ |
Groundcover | ✓ | ✓ | ✓ |
KloudMate | ✓ | ✓ | ✓ |
Coroot | ✓ | ✓ | ✓ |
DeepFlow | ✓ | ✓ | ✓ |
HyperDX | ✓ | ✓ |
In reality, the likelihood is that OeC/eC is not going to replace the LMT paradigm or make OpenTelemetry obsolete any time soon. It is more likely indicative of the fact that observability is a rapidly evolving space and a growing market. Observability is in a state of flux - and is facing ever greater demands - it is being applied across more technologies, more architectures, and more domains. It is also being asked to provide richer insights and lower costs. As the market enlarges there is increasing space for niche players to enter the market with newer technologies and architectures aimed at solving more specific problems. This is because observability is not a zero-sum game - it is a growing ecosystem where many paradigms and technologies may coexist, integrate and complement one another.
Comments on this Article