There has been some interesting discussion recently as to whether observability practice is over-reliant on dashboards - to the exclusion of providing genuine practical value to users. This touches on a really interesting point about the design and functionality of observability systems. On the one hand, it is hard to envisage any observability system not making use of dashboards. On the other hand, it is probably correct to point out that the spirit of observability is one which is proactive and on the front foot, whereas the classic dashboard represents a more static and reactive posture.
Whilst dashboarditis is a danger, the creative use of visualisations itself is not really a problem. This is because observability is, in a sense, a visual discipline. To an extent, it has to be. Observability is extremely data-heavy, and we need visual tools for summarising patterns in the billions and billions of data points that our systems collect. The image below, for example represents the flows across the 1,600 microservices used by Monzo bank.
Creating UI's to handle this volume of data and this level of complexity is a daunting task. In fact, the need to create visualisations and metaphors which capture huge streams of telemetry and translate them into an easy intelligible form is inherent within the nature of observability. The gauges, meters and charts we use have their origins in industrial roots of the practice of observability.
Observability is sometimes described as, fundamentally, a data engineering discipline. In his 2009 book Information Is Beautiful, David McCandless explored how designers across many fields have captured the underlying patterns in that raw data and translated them into information in the form of compelling visualisations.
Creating these visualisations is, itself, a craft requiring enormous skill and an understanding of design, composition, colour and the principles of creating an engaging User Experience. In this article we peel off from some of the usual themes of observability commentary and turn the spotlight on to the work of some of the designers in the field, whose skill and creativity is as critical for the success of an observability systems as those who write the code.
One of the great fallacies about design is that it is just a skin or a surface applied to the machinery created by developers. In fact, design is a discipline that considers the system in a much broader sense and good UX design is inseparable from good system design.
Every designer in the observability field understands that the tooling must be able to go beyond merely painting pictures of anomalies and outages. It must also be a tool to assist engineers in identifying root causes and taking remedial action. This then, is the two-fold challenge of the UI/UX designer of an observability system. Firstly, enormously vast amounts of data must be assembled into easily understandable patterns. Secondly, UX engineers must construct pathways through this data that leads from visualisation to action and productivity.
Naturally, as the dark mode vs light mode debate shows, opinions on 'good' UI are largely subjective and depend on individual user tastes and preferences. What follows therefore is, inevitably, a personal take on design and User Experience in a particular set of tools. There is an incredible amount of sophisticated design in observability tooling and we can really only cover a small example in this article.
It might be appropriate to start our tour of Observability design with Grafana - as visualisation is embedded not only in the company name but also in their DNA. Grafana dashboards are a ubiquitous sight in observability, and they are renowned for their breadth as well as their power and ease of use. Their dashboards have been used for everything from managing space missions to analysing milk production on a small Swiss farm. The creativity and ingenuity of the Grafana community is celebrated in the annual Grot awards, which recognise excellence across a number of categories.
The popularity of Grafana has not come about by accident. As this article by Nilson Gaspar illustrates, the familiar grace and simplicity of the Grafana UI is the product of a rigorous and painstaking design process. The article describes the process behind the design for the Grafana Frontend Observability product. It is a fascinating insight into the role of the designer and how it involves continuous interaction with engineers, as well as an awareness of business outcomes. Nilson discusses how he drew inspiration not just from common observability motifs but also from sources such as video games and programming IDEs. He also had to familiarise himself with the underlying technologies and delve into both OpenTelemetry as well as the Grafana Faro Web SDK.
One of the most widely admired visual exploration tools in the whole of the observability market is probably the Honeycomb BubbleUp feature. This is a unique and incredibly powerful and intuitive tool for digging into your data to find correlations and resolve anomalies. One of the most striking aspects of the tooling is the incredible simplicity of the UX. To activate the feature a user just draws an arbitrary box around the data points on a heatmap. The BubbleUp feature then generates a series of histograms comparing the selected dataset with the baseline - i.e. all of the remaining events.
Like all of the best visual tooling, the purpose of BubbleUp is not simply to provide some cool UI. It serves a serious purpose in helping users discover anomalies amidst the enormous volumes of high cardinality data ingested by modern observability systems. It is also a tool which democratises the investigative process - you do not have to be a PromQL guru to delve into your data and find patterns and relationships.
The BubbleUp design was created by Danyel Fisher - a UX and data visualisation specialist. As Martin Thwaites, Principal Developer Advocate at Honeycomb told us, part of Danyel's brief was to design a feature which would not just deliver a wow factor, but also to tap into the rich dimensionality of the Honeycomb datastore. As with most great UX implementations, the feature did not just pop into Danyel's head fully formed. It was the result of "working closely with our target users to understand their needs, iterating on the design, and then tracking the use of the tool over time."
This approach involved gaining a deep understanding of how users went about investigating anomalies. They defined the typical debugging approach as the "core analysis loop", whereby a user would select a base metric for an anomaly and then iteratively compare it with other metrics in the hope of discovering some kind of causal relationship. Given that Honeycomb is a high dimensionality system, this could end up being a lengthy and repetitive process.
The purpose of Bubble Up is to shorten that loop. However, given that Honeycomb can capture so many dimensions, the next task was to identify which were the best candidates for comparison. This involved further research and beta testing with customers. This led to a number of refinements including changing the colour palette from blue/green to blue/yellow, evaluation of ranking algorithms and the introduction of donut charts to represent the number of non-empty values in a dataset. Overall, the juxtaposition of the selected dataset with the array of baseline histograms serves as an illustration of James Tufte's principle that "Reality is multivariate".
As we have mentioned, many of the visual metaphors of observability are inherited from the discipline's industrial origins. However, observability engineers have also had to create new kinds of visualisations to meet the challenges of representing the special behaviours of modern, distributed computer systems. Possibly the best known of these is the Flame Graph, which was devised by Brendan Gregg whilst working on a MySQL issue where he "needed to understand CPU usage quickly and in depth". The primary diagnostic tool for this kind of issue is a CPU profiler, which takes samples of stack traces. Unfortunately, for a large sample this produces virtually unreadable "walls of text".
To make the problem manageable, Gregg created a prototype of a visualization that leveraged the hierarchical nature of stack traces to combine 'common paths'. Since the visualization explained why the CPUs were "hot" (busy), he chose a warm palette. With the warm colours and flame-like shapes, these visualizations became known as flame graphs. The Y-axis of the graph shows the stack, with the root call at the bottom. The X-axis does not represent time. Instead, it represents the prevalence of function calls within a particular level. The width of each horizontal function box represents its frequency - not its processing time.
The final form of the Flame Graph was not arrived at without a substantial amount of trial and error. As Gregg points out in his article, he originally tried using visualisations created by Neelakanth Nadgir and Roch Bourbonnais - which used the X-axis to represent the passage of time. Brendan removed the time element and "reordered samples to maximize frame merging", to create the result we see below.
Brendan has published quite a bit of documentation around flame graphs and their evolution and uses. In this presentation on SlideShare, he even shares how you can roll your own flame graphs using the Linux dtrace tool and his own Flamegraph library
The market is awash with network monitoring products, and it takes a lot to stand out from the crowd. One product which certainly does make an impression is IBM's SevOne. The quality of SevOne's design is not just a nice-to-have. It is actually a functional necessity. As Product Manager Brendan Schimmel put it, SevOne is designed to monitor the largest and most complex networks on the planet. This means ingesting huge amounts of telemetry. As Brendan notes though, anybody can fill up a hard drive. The real challenge is building systems to enable users to easy assimilate that data and act upon it.
Interestingly, Brendan is a Network Engineer by background. As the design lead for SevOne though, he also displays both a deep understanding of UX as well as an obsessive attention to detail. At the heart of his design thinking is the “Don’t Make Me Think” philosophy. This translates into a UI where the goal is to make the User Experience as frictionless and productive as possible.
Tools such as SevOne are not merely intended to visualise the state of a network or a cluster or a device. They must also create pathways that users can navigate through as effortlessly as possible. Structuring those flows and funnelling the user to the exact view they need is an art as well as a science and is the result of continuous iteration and re-working. One of the aspects of SevOne that we were most impressed by is its ability to summarise and aggregate whilst also retaining the full context of the underlying data. This means the system has extremely powerful capabilities both for drilling down as well as for finding rich correlations at each level of analysis. It also has a tremendous array of ease-of-use features such as Chains, report versioning and even Git style-stashing, where you can temporarily save a piece of work and come back to it later.
As we noted recently in our newsletter feature on SevOne, the aesthetics of the product are uniformly gorgeous, so we will just alight on a couple of examples where there is a compelling fusion of creative form with technical function. Classically, patterns for metrics such as CPU usage are expressed as line graphs. This can be a useful visual aid for spotting peaks and troughs and anomalies. The graphic below however is a great example of repackaging that data in a form that immediately highlights hotspots using a calendar format so that temporal patterns are really brought to the forefront.
The Sankey Diagram is a common method of visualising flows of network traffic. Interestingly, its history goes all the way back to the 19th century, when it was devised by the Irish naval captain Matthew Henry Phineas Riall Sankey as a means of depicting the energy efficiency of a steam engine. The image below is a reproduction of the first Sankey diagram from 1898:
The image below is a detail of a Sankey Diagram in SevOne - and as you can see, the state of the art has moved on somewhat. The modern incarnation is virtually unrecognisable from its predecessor.
Sankey Diagrams are used widely in network monitoring tools, but the designers at SevOne have displayed an attention to detail which really elevates the user experience. The font on the labels of the stage markers is really sharp. There is a tremendous balance in the muted colours of the waves and the vertical bars. The swerve of the lines is really elegant and there is really fine attention to detail - when the strands overlap the shading changes to create a really subtle contrast. When you look at the Sankey diagram as a whole, you get a clear summary of the overall flows. When you zoom in, you still see details with great clarity and without any strain on the eye.
If observability was a country, one of its national sports would probably be Datadog Bashing. Like Apple, they seem to have a polarising effect. Commercially, they are highly successful, yet they are also demonised as a kind of Death Star in the observability galaxy. As anybody who has ever used the product will know though, the User Experience is superlative. As soon as the system loads, you somehow feel as if you are in a comfortable and familiar space and navigating the UI seems fluid and pleasant.
Analysing logs is a bread-and-butter task for any observability specialist. Whilst many systems provide powerful query languages, Datadog also provides a whole number of intuitive visual tools for easily generating views of log data:
The essence of this screen is not that it is spectacular, but it is an outstanding example of good UX practice. It is hierarchically structured and task-oriented and provides a wide array of functions without overwhelming the user. This is achieved by expert use of colours and contrast and detailed planning in layout and ordering of the controls on the screen. Sometimes the art of UI lies in making the artistry invisible.
As we have seen in the images returned by the Hubble telescope, nature has a way of arranging itself in incredibly beautiful and elaborate patterns. Not everything in nature is a bell curve - spikes, surges, spirals and explosions abound. ADSB Exposed is a web site that allows users to create stunning interactive maps from real-time flight data. As the image below shows, distributions of data from our own human activity can create spectacular patterns reminiscent of that amazing Hubble imagery.
The image below evokes the incredible energy and vibrancy of an abstract expressionist work but is, in fact, a representation of flight data taken from an amazing gallery of visualizations featured on the ClickHouse web site:
It is easy to see similarities in the above image with the abstract beauty of works by artists such as Gerhard Richter and Danny Giesbers.
The observability marketplace today is extremely competitive. Many vendors compete on criteria such as price or ability to ingest vast volumes of data. Ultimately, observability is about making sense of how our systems are running. Being able to ingest data inexpensively is great - but it has little value for businesses if the system cannot harness that data for actionable insights and intuitive workflows. Ingesting petabytes of data does not help if we cannot easily sift through that data and identify issues quickly and easily. The tooling that enables this ease of use is brought to us by skilled professionals deeply immersed in principles of design and communication but also capable of understanding the fundamentals of observability technology and the needs of end users.
Comments on this Article