The Use of eBPF in Observability Today

Part Two

In the first part of this article we looked at eBPF from a general point of view and discussed some of its overall impacts on observability practice and implementations. In this second part, we will look at some of the specific technical and functional aspects of the use of eBPF in a number of leading products.

Cilium

Many people might regard Cilium as a networking or network security tool but it is also, in both a literal and practical sense of the word, an observability tool. For any SRE, DevOps or Cloud Engineer who really wants an understanding of the state of connectivity and traffic flows and in their K8S cluster, it is a godsend. It is not an observability stack and it does not process telemetry. It is, however, a very useful part of any observability toolkit. Once installed on a K8S cluster you can achieve deep and granular insights by running some very simple commands on the Cilium CLI. You can easily view all of the endpoints in a K8S network and view details such as status, IP address, MAC address and node.

Cilium networking overview

You can also easily run tests which will probe into different aspects of network connectivity, including intra-node, inter-node, and external requests. The connectivity tests will also validate DNS resolution and http/https requests - which can be a really vital aid to debugging K8S connectivity issues.

Cilium also ships with a module called Hubble - which, given its name, is perhaps not surprisingly described as the observability component of the solution. Hubble allows users to visualize and monitor network flows in real-time, providing detailed information about the source, destination, and type of traffic flowing through the cluster. It offers detailed observability into security-related events, such as policy decisions, allowing users to see which network policies are being enforced and how they affect traffic. This is essential for security auditing and compliance. If you would like to find out more about setting up and running Cilium and Hubble then have a read of our recent article.

GroundCover

Groundcover is a tool which we mentioned recently in our Beyond The Single Pane of Glass article. The GroundCover model is quite a radical departure from the standard approach to observability infrastructure. With Groundcover, the storage layer runs in the customer’s own environment. You can choose to manage the infrastructure yourself, or, if you do not want to take on this overhead, you can have it remotely managed by Groundcover engineers, who will take care of monitoring the system as well as managing patches and upgrades.

The diagram below gives us a useful starting point for understanding the implementation of eBPF in the Groundcover system.

Groundcover architecture

Groundcover uses a proprietary eBPF sensor to provide visibility not only for traces and metrics but also for logs. This goes one step further than many other solutions, which use eBPF for capturing metrics and traces and then use a tool such as the oTel collector for logs. Naturally, this assumes that the instrumented services are sending their logs to the standard output channels.

Even though many vendors in the marketplace are leveraging eBPF, expertise can vary widely. Some vendors build on top of existing SDK's whereas others go down the road of building up their own sensors. Groundcover are in the latter camp and recently developed Flora, an open source eBPF observability agent which is the engine for Groundcover's eBPF functionality and is built with a focus on performance and scalability.

The engineers behind Groundcover say that their sensor takes an 'opinionated' approach. Rather than just being a passive pipeline, it will make decisions on cardinality and create aggregations which streamline telemetry flows and reduce storage volumes (and hence costs). Processing, correlating and filtering telemetry as it is being streamed is a powerful capability and adds particular value when ingesting from sources such as the K8S metrics server, which can generate large volumes of largely redundant data.

This level of scalability is vital for Groundcover as it positions itself as an observability platform for solutions that run in environments with thousands of nodes. Groundcover CEO Shahar Azaulay is a passionate advocate for the value of eBPF monitoring solutions, seeing them as central to a 'revolution' that will provide cost-effective solutions for log management, metrics and full tracing.

Odigos

Odigos is an open-source project that uses eBPF to generate distributed traces in OpenTelemetry format, with support for over 30 observability backends. They are heavily committed to OpenTelemetry and the oTel project's (eBPF-based) auto-instrumentation for Go applications was developed by engineers at Odigos.

Like many eBPF solution providers, Odigos emphasise speed and simplicity of onboarding compared to SDK-based products. They claim that you can be up and running on their platform in a matter of minutes. On-boarding is a simple, two-step process which consists of selecting an application and then a backend (Odigos does not have its own storage - instead you send telemetry to the backend of your choice).

The Odigos model

Odigos supports a number of languages including Java, Python, .NET, Node.js, and, of course, Go. Like other tools, it supports metrics and logs, but the developers behind the product really emphasise its handling of distributed traces as a stand-out feature. Propagating context across services can be fiendishly tricky and Odigos claim to be the only solution on the market capable of "100% tracing parity". This is achieved through various eBPF mechanisms that securely and reliably inject the relevant headers. One such mechanism involves tracking functions and structs across different versions of open-source libraries.

One other interesting feature of Odigos is its technique for language detection. This is obviously a challenge for eBPF applications as they are running in the kernel, rather than being attached to the executable itself. Additionally, the act of detection can itself create latency. Initially, the Odigos Instrumentor would spin up an ephemeral language detection pod. This would inspect code running in an application pod, report the language back to the Instrumentor and then be terminated. As well as creating a performance overhead, this also required elevated privileges and could even result in OutOfPods errors. Languages are now resolved via a virtual device running on each cluster node. This improves performance and scalability by removing the overhead of creating and deleting pods. For Odigos, this is an important consideration when running in environments with thousands of microservices.

Beyla

Beyla is an open-source eBPF monitoring tool unveiled by Grafana at last year's ObservabilityCON event in London. In common with the products we have already covered, it enables automatic instrumentation for applications written in a number of applications including Go, C/C++, Rust, Python, Ruby, Java, NodeJS, .NET, and more. At present in will capture metrics and traces, but not logs.

Conceptual overview of Beyla

In our discussions with vendors, a recurrent theme is the complexities involved in capturing traces using eBPF - particularly in terms of propagating Trace Ids to upstream services. Grafana Labs takes a two-pronged approach to addressing this problem. This involves a combination of automatic header injection and black-box context propagation. Automatic header injection involves using eBPF probes to read the incoming trace information and then injecting the trace header in outgoing calls. Whilst this capture and inject method is conceptually simple and self-contained, a current limitation is that it only works with the Go programming language. Black-box context propagation on the other hand, requires a more stateful and orchestrated solution, but it will work with any programming language.

This method works by uniquely identifying each connection request when one service invokes another. When the request completes, it can then propagate the context to the receiving service by retrieving it from a locally stored eBPF map. Unfortunately, one limitation to this approach is that both services must reside on the same node. A solution to this would be to use external storage and this may be a feature in future versions of Beyla.

Beyla can be run as a stand-alone tool but it can also be run as a component within Alloy - the Grafana distribution of the OpenTelemetry Collector. Being able to bolt an eBPF sensor onto your oTel collector courtesy of some simple YAML config is a compelling proposition and an example of the kinds of power-ups that we can expect from custom distributions of the Collector.

Apache SkyWalking

Apache SkyWalking is a powerful open source observability platform with a large and robust ecosystem. It boasts a plethora of agents, which are responsible for gathering telemetry from clients and forwarding it on to a backend which is responsible for processing and storage. Rover is the Skywalking eBPF and agent. It provides code profiling but only for C, C++, Golang, and Rust. It also provides profiling for L4 and L7 network traffic and TLS. Although Rover has an extensive set of configuration options it does not, at present, support tracing.

The real strength of Rover lies in its network monitoring and profiling capabilities. The sensor has been constructed to provide deep and granular insight into network traffic and has an awareness of multiple networking protocols and technologies. It also ships with an almost dizzying array of configuration options. It generates a rich set of metrics to assist engineers with tasks such as troubleshooting network issues and monitoring service mesh performance.

Conclusion

In this article we have only actually reviewed a sample of the current implementations of eBPF in observability tooling. There are numerous other examples such as the Elastic Profiler (which has now been donated to the OpenTelemetry project) or the use of eBPF in platforms such as Datadog. Our main aim has been to cover the strengths of eBPF, the scope of its usage within observability solutions and some of its challenges and limitations at present. Our main takeaways are:

Adoption is growing rapidly and we think that it will soon be standard across the majority of solutions
There are still many use cases that require instrumentation
Not all eBPF implementations are the same. eBPF is a capability baked into the Linux OS. Different solutions will leverage that capability with vary degrees of performancer and scalability
The scope of eBPF is not limited to providing telemetry about services - it can also provide monitoring for networks and infrastructure.

As we have said above, adoption is growing rapidly and, at the same time, more advanced techniques are being developed by vendors. We look forward to watching how the application of this technology evolves in the observability sector over the coming years. Finally, if you would like to see some "extreme eBPF observability" in action, check out this YouTube video where RedHat Kernel Engineer Viktor Malik asks the mind-boggling question, "what if I wanted to attach to all 80,000 Linux Kernel functions at once".

Comments on this Article

You need register and be logged in to post a comment

From the web

Articles we like from observability web sites and blogs

Internal Observability at Uber
Uber Blog Jun 10, 2024
Stories about Uber architecture always seem to be interesting, not least because they always involve technology at huge scale - such as this trillion record migration from DynamoDB. This article, however, is actually interesting on a number of levels. As well of being of technical interest it also provides some fascinating insight into internal team topologies and management processes - which are also fundamentally important aspects of managing observability at scale. Whilst most organisations will only operate at a fraction of Uber’s scale, every organisation is seeking to minimise costs and improve service to users, and the article provides a number of insights which would be of interest to most observability practitioners.
Observability Principles for ML Models
Datadog Blog May 16, 2024
A survey carried out by McKinsey in 2021 found that 57% of respondents were already using Machine Learning to support at least one business function. ML is no longer a niche concern but is becoming a core component of development and CI/CD practices. As this post from the Datadog blog notes, the efficacy of ML models will inevitably degrade over time, so monitoring their performance and reliability is critical. The article really drives home the point that ML is a domain with its own specific behaviours, and effective monitoring requires building out new processes, metrics and even infrastructure to cover issues such as Data Drift, Prediction Drift and Concept Drift. Whilst the article does use some specialist terms, it is a highly readable and practical guide to the subject of ML monitoring.
Observing Observe with Observe
Observe Apr 13, 2024
It sounds like it could be a sub-plot in the film Inception, but this is a really interesting article from the Observe blog on how they use an instance of their Observe system to monitor their Observe cloud platform. Observe not only have to support fast reads for complex user queries, they also have to support ingesting one petabyte of telemetry per day. As you can see from the above diagram, Kafka and Snowflake form two of the pillars of the backend architecture. This three-part series offers a fascinating insight into Observe’s own internal observability strategy as well as being a great exemplar of the eat your own dog food principle. This is an article which is of great value to anybody with an interest in large-scale observability architectures.
The $1m Line of Code
InfoQ Apr 5, 2024
Most of us have experienced the anguish of bill shock at some point. Being hit with a huge bill for mobile roaming charges on return from your holiday or getting a penalty notice for an inadvertent motoring infringement that happened weeks back. Those are just small pinpricks though, compared to the 50,000 volts of financial burn felt by companies mentioned in this transcript of a scintillating talk by Erik Peterson, CEO of CloudZero. He argues, persuasively, that engineering decisions are buying decisions. In the case mentioned in the headline, a decision to turn on one section of debug code led to vast volumes of logs being emitted and racking up over $1m in costs.
An Engineer's Personal Retrospective'
CEP Mar 9, 2024
This is a really engaging blog post by Infrastructure Engineer Jack Lindamood, where he reviews nearly every infrastructure decision he made over four years working at a start-up. Each choice is graded with a Regret, Endorse or an occasional Unsure. Whilst not explicitly observability-related, it will however, have resonance for any engineer forced to make technological choices (which is probably all of us). The article contains much distilled wisdom and some strong opinions, as well as general observations on the challenges and trade-offs faced by infrastructure engineers.
Finding relationships in your data with embeddings
Medium Feb 8, 2024
The RAG pattern has really gained traction over the past year as it allows enterprises to leverage the power of LLM's to gain insights into their own data. This is a fascinating and (occasionally technical) article which details how Incident IO used vector embeddings to mine through their data and discover related incidents. The article explains the techniques involved with great clarity and provides really helpful advice on creating embeddings to find hidden patterns in your own data.
How Chik-Fil-A Run 2,800 Edge Clusters
Medium Dec 29, 2023
When you think of large corporations pushing the technology envelope, Chik-Fil-A might not be the first name to come to mind. However, the highly distributed nature of their infrastructure presents massive observability challenges, which they have met with some very impressive engineering. The scale of their task is daunting - 2,800 Edge Kubernetes clusters, tens of thousands of IoT devices and billions of MQTT messages each month. This is a really fascinating article on managing IoT observability at scale.
Production-Ready Observability Platform for AI Systems
Medium Nov 3, 2023
In this blog article, Bijit Ghosh of Deutsche Bank discusses best practices for observability across the full AI system lifecycle. He composes a custom system which knits together a range of technologies including structlog, Flask, Prometheus and Kibana as well as AI-specific tools such as MLFlow and CausalML. It’s a comprehensive article which exhibits a clear understanding of both observability and AI technologies.
Infrastructure Monitoring with the TIG Stack
CNCF Blog Sept 21, 2023
A great example of managing the complexities of Observability engineering. Jay Taylor from InfluxDB builds out a solution using the Telegraf, InfluxDB, Grafana stack.
Deploying a Kubernetes monitoring stack
rtfm July 23, 2023
An in-depth look at monitoring K8S with the increasingly popular VictoriaMetrics platform. This follows an end-to-end process from crafting your own Helm chart to configuring alert rules.
"You're overpaying for OpenTelemetry's verbosity"
rtfm Oct 10, 2023
This has really raised a few eyebrows. A forensic analysis by Nikolay Sivko of coroot on how just a few OpenTel meta tags can potentially explode your ingestion fees.