Cause and Defect

Root Cause Analysis With Causely

Imagine two factories - one in London, one in Manchester. Each has a bell which rings at exactly midday to mark the end of the morning shift. Empirically, it is true that every time the bell in Manchester sounds, the workers in London will knock off for lunch. Obviously, the first is not the cause of the second. For philosophers though, the question is how do we actually know this? What is the nature of causality and how can it be observed? In their incredibly entertaining tour of mid 20th century philosophy, John Eidinow and David Edmonds relate how this seemingly frivolous question occupied the minds of thinkers such as Bertrand Russell and Richard Braithwaite in the Cambridge of the 1940's. At first glance, these arguments might seem arcane and esoteric, yet today they are still occupying researchers in the cutting-edge field of Causal AI.

This is because, whilst causality may seem obvious from a common-sense point of view, the actual mechanics can prove rather elusive. Causality seems to only manifest itself as a poltergeist, it is something that we infer from effects rather than observe directly. This is a problem that was addressed by the Scottish philosopher David Hume in the 18th century. In his book on Causal Inference and Discovery in Python Aleksander Molak summarises Hume's' position as follows:

We only observe how the movement or appearance of Object A precedes the movement or appearance of Object B

If we experience such a succession a sufficient number of times, we’ll develop a feeling of expectation

This feeling of expectation is the essence of our concept of causality.

In this reading therefore, cause is not a phenomenon that we can see or experience at first hand. It is a cognitive construct that we apply on the basis of repeated experiences.

The Puzzle of Causality

The question of causation is a fundamental concern in all kinds of observability scenarios - not just in computing but in a wide range of domains such as engineering, transport, health and many others. However, if humans struggle to define a formal system of logic for establishing cause and effect, then how can we model it in computer systems? How, from the vast mass of individual data points ingested into an observability system, can a computer program construct deterministic relationships and state, categorically, that event A and event B have a causal relationship? The question becomes even more vexed in contemporary computer architectures where code is not confined to a single executable but is distributed across decoupled components.

The revolution in AI and the massive growth of the observability market have converged to create a fertile space for companies aiming to provide a solution to this problem. One of the early movers in this space is Causely, a company which recently raised $8.8m in seed funding to "deliver the IT industry's first causal AI platform". Before delving into Causely as a product, it might be worth taking a detour around the problem domain to gain some theoretical context.

The Problem With Correlation

AI is a term which is being used ever more frequently in product marketing. Often, what this boils down to is the ability to make associations and correlations on the basis of machine learning. Whilst this can produce incredibly useful results in many contexts, it does have its limitations. For purposes such as root cause analysis, these limitations become especially problematic.

If you have ever used a tool such as Dall-E, you cannot fail to be impressed at its ability to turn simple text commands into compelling images. The example below, however, shows that whilst Dall-E can generate a well-formed image it has not "understood" the sense of the request. It has correctly associated words with images but the assembled result seems to reflect the patterns of the training data rather than an intelligent adaption to the request.

causal ai image

Correlation is generally regarded to be prone to two major weaknesses. The first is that correlations, even though they may be statistically strong, can often be arbitrary. A classic example is that increases in incidents of drowning can be very strongly correlated with increases in the consumption of ice cream. Whilst we could speculate that increased sugar intake from ice cream consumption may have some undesirable metabolic effects and result in swimmers getting into difficulties, this is not a convincing explanation. The common factor underlying both of these trends is increases in average temperature. As the weather gets hotter, people are more likely to go swimming and therefore more drownings are likely to occur. Some people argue that such misleading associations are rare in the real world. The example below however, is just one of a whole menagerie of spurious - and highly entertaining associations - curated on the Tyler Given web site.

causal ai image

No Sense of Direction

The second major limitation of correlations is that they do not necessarily reveal the direction of a relationship. There is, for example, a correlation between depression and low Vitamin D consumption. What is not clear, however, is whether depression causes low Vitamin D consumption or vice versa. One of the most infamous abuses of this ambiguity over the direction of causality was framed by the statistician Ronald Fisher, in his notorious paper Cigarettes, Cancer and Statistics. In the 1950's a number of statisticians used Bayesian methods to produce a compelling case for the link between smoking and lung cancer. Fisher who was both a heavy smoker as well as a paid consultant to the tobacco industry, launched a vigorous and sustained counter-attack where he questioned whether it was actually cancer that caused smoking.

causal ai image

Cause Without Correlation

Interestingly, as the diagram below shows, it is also possible to have cause without statistical correlation. There are interesting real-life examples of these kinds of V-shaped relationships between two variables. For example, the effect of changes in temperature on the metabolism of animals. In this case, the value of 0 represents an optimal baseline. A deviation from this baseline - either positive or negative direction will result in the same upward curve of physiological stress. In some statistical methods these two trends will cancel each other out so that the correlation is low or zero.

causal ai image

Expert Knowledge

There are a number of approaches in the field of Causal AI which can help to overcome the limitations of the correlation approach. One of these is the use of expert knowledge - a term covering various types of knowledge that can help define or disambiguate causal relations between two or more variables. Depending on the context, expert knowledge might refer to many different sources including randomized controlled trials, historic evidence, empirical data or even the laws of physics.

The seminal paper High Speed and Robust Event Correlation by Shaula Yemini et al probably represents a classic example of the expert approach. The paper dates back to 1996 but its analysis seems to be as relevant as ever. Its main thesis revolves around how an apparently tiny fault within one component unfolds a kind of butterfly effect within a larger distributed network system. It highlights how a statistically insignificant fault in a network interface results in network packets being dropped. This, in turn, leads to a throttling of the TCP window size. This has the knock-on effect of slowing down database transactions so that locks start to occur. This then has cascading effects on all services using the database. A major disruption for end users therefore results from a 0.1% loss in the capacity of a T3 network link at several removes in the causality chain. This highlights two fundamental issues:

that some problems are not observable where they originate
some problems can only be observed by their symptoms

These findings reinforce the need for root cause analyses to be embedded in frameworks of expert knowledge. As the authors state:

"to determine which events to monitor operational staff must be familiar with the operational parameters of each managed object"

The authors of the paper define a strategy which they refer to as 'coding'. This is not coding in the sense of writing a computer program. Instead, it involves identifying all of the particular symptoms of an exception and then grouping them together to create a unique profile. This kind of fingerprinting is a powerful and performant tool for identifying patterns within streams of telemetry data and mapping them back to known causes.

Causely

Circling back to Causely, it is the application of these principles that underpins its root cause analysis capabilities. At the moment, you will need to book a demo to see the product in action. However, we can get a flavour of its functionality from videos posted on the Causely web site.

The first video we will look at covers how Causely identifies causal relationships from OpenTelemetry traces (although it also has integrations for eBPF and Istio to support service discovery). Although this does not require any instrumentation at the code level, it obviously assumes that you are collecting trace telemetry. In terms of configuration, all you need to do is export your traces to the Causely endpoint (in a real-world case, you would obviously not use the insecure TLS option).

causal ai image

Causely can detect problem conditions from your trace data and assemble a graph of the issue - it can then correlate this with its own models to identify root causes. By taking this approach it can achieve a higher level of determinism whilst needing less raw data and less compute power. What is interesting is that Causely can use its expert knowledge to define a potential or actual problem autonomously. It can then combine this intelligence with its knowledge of the service graph to predict effects on dependent services. In the example below it has identified an Inefficient Locking issue:

causal ai image

We can then drill down and see the components affected by this issue:

causal ai image

Many monitoring systems identify errors on the basis of testing binary states or predefined signals and codes - e.g. pinging an endpoint or capturing output to stderr. On the basis of these videos Causely is able to autonomously identify which values in a dataset may represent a problem state - it does not have to be 'trained' to search for predefined, thresholds or ranges. If this is the case, then this is an advanced capability absent from most other products on the market.

Tracing Across Distributed systems

The second video looks at a common scenario - asynchronous failures in a distributed services system. The screenshot below depicts the architecture for a sample application:

causal ai image

We can see that if there is a failure in the RabbitMQ service, it will cascade in multiple directions. At this point, many monitoring systems may start sending discrete streams of error messages for each of the affected services - but without linking them to a root cause. In Causely, these individual error conditions can be viewed in the Symptoms screen:

causal ai image

In the defects screen, Causely will display a list of error conditions which are at the root of downstream issues:

causal ai image

We can see here that one of our defects is a memory failure in the RabbitMQ instance. If we now click on this we can view the causality chain of affected services:

causal ai image

This gives great visibility of the system-wide impact of a root defect. What is also really interesting here is that Causely does not restrict itself just to displaying outages or run time errors, it is also able to identify downstream conditions such as latency issues. Another really powerful feature is Causely's ability to use its understanding of system relationships to perform what-if analysis. The screenshot below shows Causely's 'Potential Defects' screen.

causal ai image

What is fascinating here is that these are potential scenarios. These are faults that have not yet happened. For the purposes of preventative maintenance though, we can now drill down and gain an understanding of the possible impacts:

causal ai image

This is a useful tool for preventive maintenance, SLO management and site reliability planning.

Conclusion

Today's enterprise software systems can represent significant challenges when engineering teams need to tackle errors and outages. Applications tend to have more layers of abstraction, more pluralistic architectures and vastly more complex graphs. In these scenarios, failures tend to be not only more costly but also more difficult to locate. A tool that can apply expert knowledge and trace an error back through several system layers to its point of origin can be of immense value.

Causely is not intended to be a replacement for your existing observability stack. As an integration into your existing system however, it has the potential to massively reduce MTTR and enhance reliability.

Acknowledgements

I would like to express my gratitude to Andrew Mallaband for all of his help and advice in preparing this article. I would highly recommend getting in touch with him if you are looking for a guide in navigating this terrain.

References

Cigarettes, Cancer and Statistics by Ronald Fisher

High Speed and Robust Event Correlation by Shaula Alexander Yemini et al

Wittgenstein's Poker by David Edmonds and John Eidinow

Causely raises $8.8M in Seed funding Causely Press Release

Causal Inference and Discovery in Python by Aleksander Molak

Spurious Correlations by Tyler Given

Cracking the code of complex tracing data Causely web site video

Causely for asynchronous communication Causely web site video

Comments on this Article

You need register and be logged in to post a comment

From the web

Articles we like from observability web sites and blogs

It’s eBPF for Windows!
Scorpio Software blog Mar 21, 2025
It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
It’s eBPF for Windows!
Scorpio Software blog Mar 21, 2025
It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
Inside The C++ Black Box
Elastic blog blog Mar 10, 2025
As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.
Brendan Gregg - His Latest Flame
Brendan Gregg Blog Dec 19, 2024
Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan’s latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.
System Initiative - IaC Reinvented!
System Initiative Blog Dec 19, 2024
If you have ever had to grapple with a 3,000 line Helm chart to deploy your observability infrastructure, you may be forgiven for thinking that there must be a better way to do this. Whilst YAML has a certain formal elegance, its syntax struggles to express the architectures and relationships embedded in highly complex systems.

Whilst Pulumi have tackled this problem by enabling the use of high level programming languages for IaC, System Initiative are taking a fundamentally more radical approach. Their goal is nothing other than completely reinventing IaC from the ground up. The blog article for the launch of the product is an incredibly ambitious statement of intent. The terms ‘game changer’ and ‘paradigm shift’ tend to be thrown around somewhat casually, this might be a case where their usage is appropriate.

So, what are they proposing? Well, System Initiative is IaC without the code. It is a kind of digital canvas where you manipulate digital twins of your systems. Is the future here or is this the Platform Engineering equivalent of science fiction? Read the article and decide for yourself!
How Zomato Souped Up Their Metrics With VM
Zomato Blog Sep 14, 2024
Zomato is a restaurant aggregator and food delivery service that generates vast volumes of metrics. As their company grew, they adopted a Prometheus/Thanos-based architecture - running some 144 Prometheus servers. As metrics volumes continued to skyrocket, even this architecture started to creak and the Zomato SRE team began the search for an alternative solution.

In this article on the Zomato blog, the team discuss why they opted to migrate to Victoria Metrics as well as discussing a number of features of the system which enable them to achieve better performance, lower costs and greater scalability.

The technical challenges were pretty daunting - the project involved migrating over 800 dashboards, 300 microservices and 2.2 billion active time series. We would commend this article not just for its technical insights but also for taking a warts-and-all approach in documenting some of the technical limitations of the VM solution.
Obirdability - Fowl Play With Grafana!!
Grafana Blog Jul 29, 2024
Grafana dashboards have been put to all sorts of uses over the years - for everything from space missions to monitoring milk production. In this fun but highly informative article Ivana Huckova and Sven Grossman walk us through building an observability system for bird song. Whilst this might sound slightly quirky, the techniques could be applied to all manner of applications which need to record and analyse audio inputs.

The article is a great showcase for a number of Grafana capabilities - including installing Alloy on a Raspberry Pi and adding context to Dashboard data by dynamically query sources such as Wikipaedia and the Open-Meteo weather information service.
Internal Observability at Uber
Uber Blog Jun 10, 2024
Stories about Uber architecture always seem to be interesting, not least because they always involve technology at huge scale - such as this trillion record migration from DynamoDB. This article, however, is actually interesting on a number of levels. As well of being of technical interest it also provides some fascinating insight into internal team topologies and management processes - which are also fundamentally important aspects of managing observability at scale. Whilst most organisations will only operate at a fraction of Uber’s scale, every organisation is seeking to minimise costs and improve service to users, and the article provides a number of insights which would be of interest to most observability practitioners.
Observability Principles for ML Models
Datadog Blog May 16, 2024
A survey carried out by McKinsey in 2021 found that 57% of respondents were already using Machine Learning to support at least one business function. ML is no longer a niche concern but is becoming a core component of development and CI/CD practices. As this post from the Datadog blog notes, the efficacy of ML models will inevitably degrade over time, so monitoring their performance and reliability is critical. The article really drives home the point that ML is a domain with its own specific behaviours, and effective monitoring requires building out new processes, metrics and even infrastructure to cover issues such as Data Drift, Prediction Drift and Concept Drift. Whilst the article does use some specialist terms, it is a highly readable and practical guide to the subject of ML monitoring.
Observing Observe with Observe
Observe Apr 13, 2024
It sounds like it could be a sub-plot in the film Inception, but this is a really interesting article from the Observe blog on how they use an instance of their Observe system to monitor their Observe cloud platform. Observe not only have to support fast reads for complex user queries, they also have to support ingesting one petabyte of telemetry per day. As you can see from the above diagram, Kafka and Snowflake form two of the pillars of the backend architecture. This three-part series offers a fascinating insight into Observe’s own internal observability strategy as well as being a great exemplar of the eat your own dog food principle. This is an article which is of great value to anybody with an interest in large-scale observability architectures.
The $1m Line of Code
InfoQ Apr 5, 2024
Most of us have experienced the anguish of bill shock at some point. Being hit with a huge bill for mobile roaming charges on return from your holiday or getting a penalty notice for an inadvertent motoring infringement that happened weeks back. Those are just small pinpricks though, compared to the 50,000 volts of financial burn felt by companies mentioned in this transcript of a scintillating talk by Erik Peterson, CEO of CloudZero. He argues, persuasively, that engineering decisions are buying decisions. In the case mentioned in the headline, a decision to turn on one section of debug code led to vast volumes of logs being emitted and racking up over $1m in costs.
An Engineer's Personal Retrospective'
CEP Mar 9, 2024
This is a really engaging blog post by Infrastructure Engineer Jack Lindamood, where he reviews nearly every infrastructure decision he made over four years working at a start-up. Each choice is graded with a Regret, Endorse or an occasional Unsure. Whilst not explicitly observability-related, it will however, have resonance for any engineer forced to make technological choices (which is probably all of us). The article contains much distilled wisdom and some strong opinions, as well as general observations on the challenges and trade-offs faced by infrastructure engineers.
Finding relationships in your data with embeddings
Medium Feb 8, 2024
The RAG pattern has really gained traction over the past year as it allows enterprises to leverage the power of LLM's to gain insights into their own data. This is a fascinating and (occasionally technical) article which details how Incident IO used vector embeddings to mine through their data and discover related incidents. The article explains the techniques involved with great clarity and provides really helpful advice on creating embeddings to find hidden patterns in your own data.
How Chik-Fil-A Run 2,800 Edge Clusters
Medium Dec 29, 2023
When you think of large corporations pushing the technology envelope, Chik-Fil-A might not be the first name to come to mind. However, the highly distributed nature of their infrastructure presents massive observability challenges, which they have met with some very impressive engineering. The scale of their task is daunting - 2,800 Edge Kubernetes clusters, tens of thousands of IoT devices and billions of MQTT messages each month. This is a really fascinating article on managing IoT observability at scale.
Production-Ready Observability Platform for AI Systems
Medium Nov 3, 2023
In this blog article, Bijit Ghosh of Deutsche Bank discusses best practices for observability across the full AI system lifecycle. He composes a custom system which knits together a range of technologies including structlog, Flask, Prometheus and Kibana as well as AI-specific tools such as MLFlow and CausalML. It’s a comprehensive article which exhibits a clear understanding of both observability and AI technologies.
Infrastructure Monitoring with the TIG Stack
CNCF Blog Sept 21, 2023
A great example of managing the complexities of Observability engineering. Jay Taylor from InfluxDB builds out a solution using the Telegraf, InfluxDB, Grafana stack.
Deploying a Kubernetes monitoring stack
rtfm July 23, 2023
An in-depth look at monitoring K8S with the increasingly popular VictoriaMetrics platform. This follows an end-to-end process from crafting your own Helm chart to configuring alert rules.
"You're overpaying for OpenTelemetry's verbosity"
rtfm Oct 10, 2023
This has really raised a few eyebrows. A forensic analysis by Nikolay Sivko of coroot on how just a few OpenTel meta tags can potentially explode your ingestion fees.