Beyond The Single Pane of Glass

When you look at the tone and content of many articles being written today on observability, it is hard to avoid the feeling that the discipline is at something of a crossroads. The market is expanding and diversifying rapidly. The technology is evolving and more and more companies are seeking to de-silo and harness data across multiple domains to spur on cycles of self-improvement. The explosion of telemetry data, the AI revolution, migration to the cloud, the shift to microservices and many other trends seem to be leading to a re-think of the very nature of observability theory and practice.

The Three Pillars

Up until recently, many commentators spoke of the Three Pillars of Observability – i.e. Metrics, Logs and Traces. With the inclusion of Events, this has morphed into the MELTS paradigm. Now that the OpenTelemetry project have announced the adoption of Profiling as a telemetry signal, we will need to expand the acronym once again. In practical terms, the observability market which has grown around these concerns could be roughly divided into three categories:
Point Products – i.e. products that specialise in a subset of these concerns such as log analytics or profiling. Products such as Graylog and Prometheus would be classic examples of this type. The category would also include products such as Sentry or Victoria Metrics.
Full Stack Products – this would encompass observability platforms capable of ingesting logs, metrics and traces and providing services such analytics and diagnostics. This would include products such as SigNoz, Coralogix and Chronosphere.
Full Spectrum Products – this category refers to platforms that offer not only the ‘full stack’ of metrics, logs and traces but which also offer a range of additional enterprise services such as SLO Management, SIEM, Alerting, Profiling and, in the case of some vendors, specialist features such as LLM Observability. This would include platforms such as Datadog, New Relic and Splunk.

The Single Pane of Glass

A number of the full spectrum (and even full stack vendors) dangle the tantalising possibility of the “Single Pane of Glass” (SPOG) – the ability to gain visibility across all of your applications and infrastructure within a single vendor offering. This is a very seductive sales pitch. It offers the prospect of tremendous power without the technical and administrative overheads of managing integrations and dealing with multiple vendors.

Whilst a single pane of glass may seem superficially desirable, it can actually be a source of tension and disenchantment. By definition, it means having one overarching system doing everything. This means that the same product is being used by both development and infrastructure teams. The problem is that these are teams with fundamentally differing needs, and it is not necessarily easy to satisfy both within one monolithic product.

The Tool Sprawl Phantom

Not surprisingly, the SPOG vendors regularly publish reports where they highlight the ‘problem’ of ‘tool sprawl’ and emphasise the need for 'consolidated tooling'. Tool sprawl is a rather strange condition though – vendors swear that it is a problem, but most engineers seem to be blithely unaware of it. Engineers managing observability platforms are generally techies for whom mastery of numerous tools goes with the territory. Just as a carpenter will have more than one kind of hammer and a chef will have many knives.

Tool sprawl?

The Siloed Pane of Glass?

A further problem of the SPOG is that it can create a kind of inertia within organisations. Once you have your all-in-one solution, it becomes harder for engineers to persuade managers to adopt additional tooling – not least because there may not be interoperability with the SPOG. Often, this works to the detriment of developers since the purchasing decisions behind SPOGS are often made by infrastructure engineers – on the basis that the SPOG will be running on their infra and they will be the people installing and maintaining it. Companies such as InfraStack are developing products to meet this disconnect head-on. As they note on their blog, many systems are “primarily built to meet the needs of DevOps, Site Reliability, Traffic, and Infrastructure Experts for production workloads”.

Interestingly, solutions such as Datadog or Dynatrace, which IT managers often regard as their all-in-one solution, do not have to be procured as a monolithic product. Instead, their APM and Infrastructure monitoring capabilities can be bought as individual modules. This does open the door for a more pluralistic approach.

On the face of it, another benefit of the SPOG is the advantage of having all of your observability data consolidated into a single backend. In reality, not all vendors exploit the potential that this offers. Even though vendors may ingest the full range of telemetry signals, in some cases their architectures may still be siloed. Signals are kept in separate backend datastores and creating correlations between them can be either difficult or impossible.

Observability 2.0

This is not the case for all vendors and a number of companies have sought to meet the challenge of ‘Observability 2.0’ by either defining new architectures or seeking to synthesize complementary products. We will look at three examples of this trend. The first two, Observe and Dynatrace, have taken the unified backend/data-lake approach, whilst the third, Cisco, have taken what we might call a ‘synthetic’ approach of weaving application layers together behind a common interface. These are by no means the only companies seeking to pool, correlate or unify telemetry and business data, but we will take their approaches as examples.

Dynatrace

The Dynatrace Data Lake – rather unabashedly branded as ‘Grail’, is a backend store for logs metrics, traces and events. The Platform includes a suite of Business Apps for ingesting, enriching and analysing business data. However, whilst Grail makes the entities available from a unified store and provides a query language, it does not, in itself, discover relationships between your data types. Whilst you can run sophisticated analytics on your business data for example, there does not appear to be a capability for achieving a truly holistic view across the enterprise.

Observe

Observe is aggressively positioning itself as an alternative to the established full-stack commercial providers. In an extremely bullish statement, CEO Jeremy Burton described some of the established players as “dead men walking” who were “shackled by outdated architectures”. Superficially, there are some similarities between the Observe and Dynatrace Data Lakes. A major differentiator in their backend architectures though, is the Observe Data Graph. This is a service which autonomously discovers relationships between datasets residing in the data lake. It also ships with powerful visualisations for exploring these relationships:

The Observe Data Graph

Data can be queried using OPAL – the Observe Processing and Analysis Language and can also be exported by running queries through the Observe API. Overall, this seems to be a very impressive platform – not least because the architects seem to have reversed the polarities of observability system design. The architecture seems to be grounded in the philosophy of providing a unified experience where the first principle is asking questions and seeing patterns rather than viewing de-contextualised telemetry signals.

Cisco

Whilst Cisco have traditionally been regarded as a networking company, they have a huge number of IT services and have also built out a highly impressive observability platform – fuelled by acquisitions such as AppDynamics, Splunk and Isovalent. The Cisco product offering is something of a maze and consists of a large number of tools. The Cisco Full Stack Observability (FSO) solution is billed as a product which provides full visibility, tied in with business context. The FSO solution is actually an integration of three products:

App Dynamics
Intersight
Thousand Eyes

Cisco FSo Platform

Each of these are powerful solutions in their own right and they dovetail smoothly to provide a rich and multi-dimensional view of applications and their contexts. In addition to this, the FSO platform enables users to create custom observability extensions. This can provide considerable power and extensibility by enabling users to define their own custom domain entities and integrate these with existing backend data. The downside to this is that this ecosystem has limited scope for open-ended querying by third party tools.

Whilst each of these three models provide powerful and integrated analytics, they still have their drawbacks. They still require proprietary query languages, are constrained by the limits of an API or are difficult for third party applications to access. So, what is the alternative to the single pane of glass solutions?

The Data Management Perspective

Many leading thinkers in the observability space have already argued that observability is a data management problem - and this is actually a very valuable perspective. Data management though, is not simply a matter of finding ways to reduce volumes or to speed up querying or improve compression rates – although those are all valuable goals.

We can also think in terms of higher-level abstractions which open up the possibility of composable observability platforms. I think that such a framework would consist of the following abstractions:

Pipelines
Storage
Analytics

That is, we need ways of ingesting our telemetry, we need somewhere to store it and then we need tools for analysing it.

Rather than thinking in terms of a single pane of glass or a monolithic system, we could think in terms of these three functional areas and build loosely coupled architectures to support them using a plurality of tools. This would, of course, be dependent on the evolution of a standardised set of abstractions, interfaces and hooks. In this respect, Observability could follow the lead of the OpenBanking initiative – which has liberated financial data and opened the market up to a host of new vendors offering pluggable services.

Pipelines

As we are aware, OpenTelemetry is playing a pivotal role in defining standards and specifications for interoperability in observability frameworks. The OpenTelemetry client SDKs and the OpenTelemetry Collector are viable open source tools that end-users can deploy to build their own pipelines. For users who do not want the overhead of managing the oTel collector there is also a plethora of ready-made solutions such as FluentD and Mezmo.

Storage & Analytics

The next two functional domains – storage and analytics, pose more of a challenge. What would an open and accessible backend data store – one which was easily queryable by third party tools – actually look like? One obvious answer is to store telemetry in a backend database such as ClickHouseDB – which can ingest at vast scale and query at almost unparalleled speed.

Even though there are a number of open source systems which use ClickHouseDB as their backend, many customers may not wish to incur the overhead of supporting an enterprise database infrastructure. This does not have to be a deal-breaker though – systems such as Groundcover would appear to offer the best of both worlds – an accessible backend database, but without the maintenance overhead.

The GroundCover model is a radical departure to the standard approach to observability infrastructure. With Groundcover, the storage layer runs in the customer’s own environment. You can choose to manage the infrastructure yourself, or, if you do not want to take on this overhead, you can have it remotely managed by Groundcover engineers, who will take care of guaranteeing the health of the system as well as managing patches and upgrades.

Groundcover architecture

At first glance, the architecture in the above diagram may not appear to be as unified as those of Observe or Dynatrace – after all the telemetry is spread across two separate data stores, with metrics being stored in a Victoria Metrics store and other signals being stored in ClickHouse. This is probably not particularly problematic though, since direct correlations do not generally involve metrics – they are more likely to involve traces being correlated to logs. A major benefit of this architecture is that ClickHouse is an open source database and the telemetry it stores can be queried with SQL statements. There is no need to use a proprietary API or incur the learning curve of YAQL (Yet Another Query Language).

This also opens up the possibility of building custom data lakes consisting of the Groundcover telemetry store as well as heterogenous data streams – all stored in ClickHouse and accessible from ClickHouse Views. Interestingly, Groundcover is not the only commercial platform offering this hosting model. Kloudmate is another full-stack system which utilises the OeC (OpenTelemetry/eBPF/ClickHouse) stack and which also provides an option for self-hosting of the backend infrastructure. If this kind of model becomes more widespread and reaches critical mass, then it could provide fertile ground for a whole ecosystem of providers leveraging these data stores to provide custom analytics services.

Re-imagining Observability

As we have stated previously, many companies are now re-imagining observability not just as a toolset but as a practice, one which transcends traditional monitoring and provides insights and visibility across multiple business domains. Chronosphere, for example, supports integrations with sources such as GitHub, Circle CI and Launch Darkly – so that telemetry can be correlated with a range of different system change events.

The recent graduation of the CNCF CloudEvents specification means that there is now a widely agreed standard for ingestion and data exchange and this is a great foundation to build upon. There are numerous other organisational data stores such as SIEM, operational support and sales analytics that could be fed into the unified backend and correlated with observability data.

In conclusion, we appear to be at an inflection point where many organisations (and some vendors) are moving away from conceptualising observability as a set of signals and instead thinking of it as a data-driven organisational practice, one which can ingest telemetry and heterogenous business data and provide unified and correlated insights across multiple domains. The single pane of glass may work for many companies. For others though, it can become a glass of pain. For those companies, a more productive option maybe the single source of truth – a universal data source which removes siloes and unleashes the potential of an open observability ecosystem.

References

Observe Inc Architecture

Cisco Full Stack Observability

Cloud Events

Comments on this Article

You need register and be logged in to post a comment

From the web

Articles we like from observability web sites and blogs

It’s eBPF for Windows!
Scorpio Software blog Mar 21, 2025
It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
Inside The C++ Black Box
Elastic blog blog Mar 10, 2025
As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.
Brendan Gregg - His Latest Flame
Brendan Gregg Blog Dec 19, 2024
Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan’s latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.
System Initiative - IaC Reinvented!
System Initiative Blog Dec 19, 2024
If you have ever had to grapple with a 3,000 line Helm chart to deploy your observability infrastructure, you may be forgiven for thinking that there must be a better way to do this. Whilst YAML has a certain formal elegance, its syntax struggles to express the architectures and relationships embedded in highly complex systems.

Whilst Pulumi have tackled this problem by enabling the use of high level programming languages for IaC, System Initiative are taking a fundamentally more radical approach. Their goal is nothing other than completely reinventing IaC from the ground up. The blog article for the launch of the product is an incredibly ambitious statement of intent. The terms ‘game changer’ and ‘paradigm shift’ tend to be thrown around somewhat casually, this might be a case where their usage is appropriate.

So, what are they proposing? Well, System Initiative is IaC without the code. It is a kind of digital canvas where you manipulate digital twins of your systems. Is the future here or is this the Platform Engineering equivalent of science fiction? Read the article and decide for yourself!
How Zomato Souped Up Their Metrics With VM
Zomato Blog Sep 14, 2024
Zomato is a restaurant aggregator and food delivery service that generates vast volumes of metrics. As their company grew, they adopted a Prometheus/Thanos-based architecture - running some 144 Prometheus servers. As metrics volumes continued to skyrocket, even this architecture started to creak and the Zomato SRE team began the search for an alternative solution.

In this article on the Zomato blog, the team discuss why they opted to migrate to Victoria Metrics as well as discussing a number of features of the system which enable them to achieve better performance, lower costs and greater scalability.

The technical challenges were pretty daunting - the project involved migrating over 800 dashboards, 300 microservices and 2.2 billion active time series. We would commend this article not just for its technical insights but also for taking a warts-and-all approach in documenting some of the technical limitations of the VM solution.
Obirdability - Fowl Play With Grafana!!
Grafana Blog Jul 29, 2024
Grafana dashboards have been put to all sorts of uses over the years - for everything from space missions to monitoring milk production. In this fun but highly informative article Ivana Huckova and Sven Grossman walk us through building an observability system for bird song. Whilst this might sound slightly quirky, the techniques could be applied to all manner of applications which need to record and analyse audio inputs.

The article is a great showcase for a number of Grafana capabilities - including installing Alloy on a Raspberry Pi and adding context to Dashboard data by dynamically query sources such as Wikipaedia and the Open-Meteo weather information service.
Internal Observability at Uber
Uber Blog Jun 10, 2024
Stories about Uber architecture always seem to be interesting, not least because they always involve technology at huge scale - such as this trillion record migration from DynamoDB. This article, however, is actually interesting on a number of levels. As well of being of technical interest it also provides some fascinating insight into internal team topologies and management processes - which are also fundamentally important aspects of managing observability at scale. Whilst most organisations will only operate at a fraction of Uber’s scale, every organisation is seeking to minimise costs and improve service to users, and the article provides a number of insights which would be of interest to most observability practitioners.
Observability Principles for ML Models
Datadog Blog May 16, 2024
A survey carried out by McKinsey in 2021 found that 57% of respondents were already using Machine Learning to support at least one business function. ML is no longer a niche concern but is becoming a core component of development and CI/CD practices. As this post from the Datadog blog notes, the efficacy of ML models will inevitably degrade over time, so monitoring their performance and reliability is critical. The article really drives home the point that ML is a domain with its own specific behaviours, and effective monitoring requires building out new processes, metrics and even infrastructure to cover issues such as Data Drift, Prediction Drift and Concept Drift. Whilst the article does use some specialist terms, it is a highly readable and practical guide to the subject of ML monitoring.
Observing Observe with Observe
Observe Apr 13, 2024
It sounds like it could be a sub-plot in the film Inception, but this is a really interesting article from the Observe blog on how they use an instance of their Observe system to monitor their Observe cloud platform. Observe not only have to support fast reads for complex user queries, they also have to support ingesting one petabyte of telemetry per day. As you can see from the above diagram, Kafka and Snowflake form two of the pillars of the backend architecture. This three-part series offers a fascinating insight into Observe’s own internal observability strategy as well as being a great exemplar of the eat your own dog food principle. This is an article which is of great value to anybody with an interest in large-scale observability architectures.
The $1m Line of Code
InfoQ Apr 5, 2024
Most of us have experienced the anguish of bill shock at some point. Being hit with a huge bill for mobile roaming charges on return from your holiday or getting a penalty notice for an inadvertent motoring infringement that happened weeks back. Those are just small pinpricks though, compared to the 50,000 volts of financial burn felt by companies mentioned in this transcript of a scintillating talk by Erik Peterson, CEO of CloudZero. He argues, persuasively, that engineering decisions are buying decisions. In the case mentioned in the headline, a decision to turn on one section of debug code led to vast volumes of logs being emitted and racking up over $1m in costs.
An Engineer's Personal Retrospective'
CEP Mar 9, 2024
This is a really engaging blog post by Infrastructure Engineer Jack Lindamood, where he reviews nearly every infrastructure decision he made over four years working at a start-up. Each choice is graded with a Regret, Endorse or an occasional Unsure. Whilst not explicitly observability-related, it will however, have resonance for any engineer forced to make technological choices (which is probably all of us). The article contains much distilled wisdom and some strong opinions, as well as general observations on the challenges and trade-offs faced by infrastructure engineers.
Finding relationships in your data with embeddings
Medium Feb 8, 2024
The RAG pattern has really gained traction over the past year as it allows enterprises to leverage the power of LLM's to gain insights into their own data. This is a fascinating and (occasionally technical) article which details how Incident IO used vector embeddings to mine through their data and discover related incidents. The article explains the techniques involved with great clarity and provides really helpful advice on creating embeddings to find hidden patterns in your own data.
How Chik-Fil-A Run 2,800 Edge Clusters
Medium Dec 29, 2023
When you think of large corporations pushing the technology envelope, Chik-Fil-A might not be the first name to come to mind. However, the highly distributed nature of their infrastructure presents massive observability challenges, which they have met with some very impressive engineering. The scale of their task is daunting - 2,800 Edge Kubernetes clusters, tens of thousands of IoT devices and billions of MQTT messages each month. This is a really fascinating article on managing IoT observability at scale.
Production-Ready Observability Platform for AI Systems
Medium Nov 3, 2023
In this blog article, Bijit Ghosh of Deutsche Bank discusses best practices for observability across the full AI system lifecycle. He composes a custom system which knits together a range of technologies including structlog, Flask, Prometheus and Kibana as well as AI-specific tools such as MLFlow and CausalML. It’s a comprehensive article which exhibits a clear understanding of both observability and AI technologies.
Infrastructure Monitoring with the TIG Stack
CNCF Blog Sept 21, 2023
A great example of managing the complexities of Observability engineering. Jay Taylor from InfluxDB builds out a solution using the Telegraf, InfluxDB, Grafana stack.
Deploying a Kubernetes monitoring stack
rtfm July 23, 2023
An in-depth look at monitoring K8S with the increasingly popular VictoriaMetrics platform. This follows an end-to-end process from crafting your own Helm chart to configuring alert rules.
"You're overpaying for OpenTelemetry's verbosity"
rtfm Oct 10, 2023
This has really raised a few eyebrows. A forensic analysis by Nikolay Sivko of coroot on how just a few OpenTel meta tags can potentially explode your ingestion fees.