Open Telemetry Tracing With Grafana Tempo

Last updated: 2024-1-16

OpenTelemetry is the most important initiative in Observability at the moment and is destined to set the standards for telemetry for years to come. Although it brings with it the promises of standardisation, consistency and flexibility, all of that value does come at a price.

With that power there also comes a considerable level of complexity and quite a long learning curve. There are no short-cuts to mastering it. It is not a monolith, but when you first encounter it, it can seem somewhat daunting.

The OpenTelemetry Collector

Architecturally, the centrepiece of OpenTelemetry is the Collector. In some ways, the word 'Collector' does not really do justice to the capabilities or nature of the service. I think it is best conceptualised as a Telemetry Gateway to receive, transform and forward your telemetry.

At this point it is worth noting that adopting OpenTelemetry does not mean that you have to use a Collector. It is entirely possible that you can meet your observability needs just by using the OpenTelemetry SDK in your source code and then emitting your telemetry directly to a backend such as DataDog, Tempo, SigNoz, Honeycomb etc.

Requirements for this exercise

If you want to follow along and run the code referenced here you will need the following tools:

.NET 7
Visual Studio Community 2022
Access to an Azure AKS cluster
Helm
Kubectl

When to use the Collector

The decision to use the collector will depend upon a number of factors.

Architectural
Emitting your telemetry to a Collector rather than to a specific backend can give you loose coupling. You can change backends without having to update any application configuration.

Scalability
You can place your Collectors behind a load balancer and then scale up the number of Collector instances as the volume of your telemetry grows.

Filtering
You will almost certainly wish to apply filtering to your telemetry at some stage in your pipeline. Applying filtering at the Collector is a far more robust pattern than applying it within each individual service.

Transformation and Configuration
It is far more manageable to place all of your logic around transformation in a central point such as a Collector rather than duplicating it across a large number of services.

An OpenTelemetry Observability Pipeline

The OpenTelemetry Collector is not a complete solution in itself - instead it is one component in an overall pipeline. It is not a tool for visualising or storing your data - you will need other applications for those purposes. An OpenTelemetry-centric implementation will therefore look something like this:

Click on image to enlarge

Setting up backends for all of the main signal types would be a very long process. In this article therefore, we are just going to focus on collecting trace data and we will use Grafana Tempo as our backend and Grafana for visualisation. Our application will be a .NET 7 web application using the OpenTelemetry .NET SDK. So the structure of our pipeline will be:

Click on image to enlarge

This will be a fully open source architecture with all components running in a Kubernetes cluster. For the examples in this article, we are running our Collector in an Azure AKS cluster using Kubernetes version 1.27.7. The guidance in this section should be applicable to any .NET 7 web application. We have not yet tested this fully against .NET 8.

.NET Configuration

OpenTelemetry Packages

The first thing we need to do is add the relevant OpenTelemetry packages to our project. These are:

OpenTelemetry.Extensions.Hosting 1.60
OpenTelemetry.Instrumentation.AspNetCore 1.60-rc.1
OpenTelemetry.Instrumentation.Process 0.5.0-beta.3
OpenTelemetry.Exporter.Console 1.60
OpenTelemetry.Exporter.OpenTelemetryProtocol 1.60

What the Packages Do

OpenTelemetry.Extensions.Hosting
This is used for hosting the OpenTelemetry context whilst our application is running.

OpenTelemetry.Instrumentation.AspNetCore
This, as the name suggests, captures telemetry specific to Asp.Net core such as Http Requests.

OpenTelemetry.Instrumentation.Process
This captures more general .Net process telemetry - such as performance and errors.

Exporters
We can configure multiple exporters to send our telemetry to different endpoints. We will use the OpenTelemetryProtocol exporter to send telemetry to our Collector.

As you can see, instrumenting our application is not simply a matter of including one monolithic package. Instead, there are a number of smaller packages targetting different technologies and functions.

Initialisation

After adding our packages, we next need to configure our telemetry options using the application builder. OpenTelemetry is a granular framework and consists of a number of different features which can be switched on or off. We need to specify which telemetry signals we wish to capture as well as specifying where the telemetry should be sent.

In the code below you will notice that we create two different builders. On this line we create a WebApplicationBuilder instance:

     
    var builder = WebApplication.CreateBuilder(args);

On these lines, we configure our logging options:

                     
builder.Logging.AddOpenTelemetry(options => 
{ 
    options 
        .SetResourceBuilder( 
            ResourceBuilder.CreateDefault() 
                .AddService(serviceName)) 
        .AddConsoleExporter(); 
});

Next we create a ResourceBuilder, which is then passed to the WebApplicationBuilder:

         
    Action<ResourceBuilder> appResourceBuilder = 
    resource => resource 
        .AddDetector(new ContainerResourceDetector()) 
        .AddService(serviceName);

When we create these builders we need to provide the name of our service. This is essential so that the source of the telemetry can be identified when viewing it in a tool such as Grafana.

Next we configure our options for tracing and metrics:

                     
    builder.Services.AddOpenTelemetry() 
      .ConfigureResource(appResourceBuilder)	   
      .WithTracing(tracing => tracing 
          .AddAspNetCoreInstrumentation()			   
          .AddOtlpExporter(options => 
          { 
              options.Endpoint = new Uri($"{oTelCollectorUrl}/v1/traces"); 
              options.Protocol=OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;			   
          })		  ) 
      .WithMetrics(metrics => metrics 
          .AddAspNetCoreInstrumentation() 
           .AddOtlpExporter(options => 
           { 
               options.Endpoint = new Uri(oTelCollectorUrl); 
               options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc; 
           }) 
          .AddConsoleExporter());

This involves specifying the EndPoint and the Protocol. Grpc is the default protocol for OpenTelemetry, although you can also use Http. As you can see, once our application is deployed it will emit metric and trace signals to the OpenTelemetry Collector and will emit logs to the Console.

                 
var builder = WebApplication.CreateBuilder(args);
const string serviceName = "relay-service"; 
string oTelCollectorUrl = builder.Configuration["AppSettings:oTelCollectorUrl"]; 
builder.Logging.AddOpenTelemetry(options => 
{ 
    options 
        .SetResourceBuilder( 
            ResourceBuilder.CreateDefault() 
                .AddService(serviceName)) 
        .AddConsoleExporter(); 
}); 

Action<ResourceBuilder> appResourceBuilder = 
    resource => resource 
        .AddDetector(new ContainerResourceDetector()) 
        .AddService(serviceName); 

builder.Services.AddOpenTelemetry() 
      .ConfigureResource(appResourceBuilder)	   
      .WithTracing(tracing => tracing 
          .AddAspNetCoreInstrumentation()			   
          .AddOtlpExporter(options => 
          { 
              options.Endpoint = new Uri($"{oTelCollectorUrl}/v1/traces"); 
              options.Protocol=OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;			   
          })		  ) 
      .WithMetrics(metrics => metrics 
          .AddAspNetCoreInstrumentation() 
           .AddOtlpExporter(options => 
           { 
               options.Endpoint = new Uri(oTelCollectorUrl); 
               options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc; 
           }) 
          .AddConsoleExporter()); 

var app = builder.Build();

OpenTelemetry Logging

The SDK, naturally, emits its own logging and you can use the OTEL_LOG_LEVEL environment variable to define the logging level. For ease of use we are just going to define this in the deployment YAML file for our web application:

 
    resources: 
        limits: 
        cpu: 250m 
        memory: 256Mi 
        requests: 
        cpu: 100m 
        memory: 128Mi 
    env: 
    - name: OTEL_LOG_LEVEL 
        value: "debug"

In addition to this you can also obtain more verbose diagnostics by dropping an OTEL_DIAGNOSTICS.json file into the root directory of your executable. The configuration for this file is very simple:

 
    { 
    "LogDirectory": ".", 
    "FileSize": 32768, 
    "LogLevel": "Warning" 
    }

Once the SDK detects the presence of this file it will start emitting detailed diagnostics to a log file whose name follows the pattern 'dotnet<random int>.log'. This can be a really useful aid to debugging but is obviously not advisable for production environments.

The OpenTelemetry Collector

We are going to be running the Collector in an Azure AKS cluster and, as per this document, we will install this as a daemonset so that it is available on each node

We are going to deploy the Collector using Helm, so the first thing we need to do is add the OpenTelemetry Helm charts to our repository:

 
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

Next, we need to define our own custom values for the Collector. We are enabling:

kubernetesAttributes
kubeletMetrics
logs collection.

Our file will look like this:

 
mode: daemonset 

presets: 
 # enables the k8sattributesprocessor and adds it to the traces, metrics, and logs pipelines 
  kubernetesAttributes: 
    enabled: true 
  # enables the kubeletstatsreceiver and adds it to the metrics pipelines 
  kubeletMetrics: 
    enabled: true 
  # Enables the filelogreceiver and adds it to the logs pipelines 
  logsCollection: 
    enabled: true 
## The chart only includes the loggingexporter by default 
## If you want to send your data somewhere you need to 
## configure an exporter, such as the otlpexporter 
config: 
  exporters: 
    otlp: 
      endpoint: "tempo-distributor.grafana-tempo.svc.cluster.local:4317" 
      tls: 
        insecure: true 
  service: 
    telemetry: 
      logs: 
        level: "debug" 
    pipelines: 
      traces: 
        exporters: [ otlp ] 
#     metrics: 
#       exporters: [ otlp ] 
#     logs: 
#       exporters: [ otlp ] 
service: 
  # Enable the creation of a Service. 
  # By default, it's enabled on mode != daemonset. 
  # However, to enable it on mode = daemonset, its creation must be explicitly enabled 
  enabled: true 
  type: LoadBalancer 
  loadBalancerIP: ""  
  annotations:    
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"

It might be worth just breaking this down a bit. On these lines we configure a traces exporter. This will export trace telemetry to an instance of Grafana Tempo running on the same cluster:


      exporters: 
    otlp: 
      endpoint: "tempo-distributor.grafana-tempo.svc.cluster.local:4317" 
      tls: 
        insecure: true

On these lines we are configuring the Collector to emit logging about itself:


     service: 
    telemetry: 
      logs: 
        level: "debug"

On these lines we are adding our otlp exporter to our pipeline:


        pipelines: 
      traces: 
        exporters: [ otlp ]

On these lines we are creating an Azure AKS Load Balancer Service, which will only be accessible inside the cluster:


       enabled: true 
  type: LoadBalancer 
  loadBalancerIP: ""  
  annotations:    
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"

By default, Load Balancers have a public IP address assigned to them. We are overriding this by setting the Load Balancer IP to an empty string. We are also adding an Azure AKS specific annotation to make the Load Balancer internal only. Different cloud providers will have different ways of implementing this option.

We can now install the Collector into our cluster:

 
   helm install otel-collector open-telemetry/opentelemetry-collector --values values.yml

KubeStats Service

When we installed the Collector using this configuration we found that the kubestats service receives this error trying to scrape metrics from the kubelet:

 
"otel-collector-opentelemetry-collector-agent-q97ql opentelemetry-collector 2023-12-09T14:31:16.104Z 

error   scraperhelper/scrapercontroller.go:200  Error scraping metrics 

{"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "error": "Get \"https://aks-nodepool1-16446667-vmss000006:10250/stats/summary\": 

tls: failed to verify certificate: x509: certificate signed by unknown authority", "scraper": "kubeletstats"}"

We resolved this by updating the configmap for the Kubestats service by setting the insecure_skip_verify option to true

 
    kubeletstats: 
    auth_type: serviceAccount 
    collection_interval: 20s 
    endpoint: ${env:K8S_NODE_NAME}:10250 
    insecure_skip_verify: true

This obviously may not be desirable for production environments

We have our OpenTelemetry Collector up and running. It is receiving and exporting logs, metrics and traces. Although the Collector is a highly sophisticated service, it is important to remember that it is, essentially, a relay. It has no UI and it does not store telemetry. It collects and forwards telemetry signals. The storage and visualisation of those signals are functions performed by tools further downstream in the overall telemetry pipeline. We will now look at the processing of trace signals in Grafana Tempo.

Traces - Grafana Tempo

If you wish, you can use the Cloud version of Tempo - and this means that you can avoid the overhead of installing and maintaining your own instance. We like to get hands on, so we are going to install our own instance in our AKS cluster. The self-hosted version of Tempo can be deployed in two modes - either as a monolith or as a set of microservices. We are going to use the distributed/microservices option. In functional terms, they are largely the same. The distributed version, however, splits the monolith into a number of smaller services which can then each be scaled independently.

A quick glance at the GitHub repo for the Helm charts for Tempo shows that it is not a trivial application and consists of multiple components:

Click on image to enlarge

Hardware Requirements
Tempo requires a node with a minimum of four cores and 16GB of RAM (https://grafana.com/docs/helm-charts/tempo-distributed/next/get-started-helm-charts/)

Set Up

In setting up our Tempo instance we will be (roughly) following the instructions on this page from the Grafana Tempo docs.

The first thing we will do is create a new namespace in our cluster:

 
kubectl create namespace grafana-tempo

Next we will add the Grafana Helm charts to our local Helm repo:

 
helm repo add grafana https://grafana.github.io/helm-charts 
helm repo update

The documentation provides the following sample values for our values.yaml file:

 
--- 
storage: 
  trace: 
    backend: s3 
    s3: 
      access_key: 'grafana-tempo' 
      secret_key: 'supersecret' 
      bucket: 'tempo-traces' 
      endpoint: 'tempo-minio:9000' 
      insecure: true 
#MinIO storage configuration 
minio: 
  enabled: true 
  mode: standalone 
  rootUser: grafana-tempo 
  rootPassword: supersecret 
  buckets: 
    # Default Tempo storage bucket 
    - name: tempo-traces 
      policy: none 
      purge: false 
traces: 
  otlp: 
    grpc: 
      enabled: true 
    http: 
      enabled: true 
  zipkin: 
    enabled: false 
  jaeger: 
    thriftHttp: 
      enabled: false 
  opencensus: 
    enabled: false

Storage

Rather than using MiniIo or Amazon S3 we will be using an Azure Storage Account, so our configuration will look like this:

storage: 
  trace: 
    backend: azure 
    azure: 
      container_name: tempo-traces 
      storage_account_name: stgappgeneraluks 
      storage_account_key: ${STORAGE_ACCOUNT_ACCESS_KEY}   

distributor: 
  log_received_spans: 
    enabled: true 
  extraArgs: 
  - "-config.expand-env=true" 
  extraEnv: 
  - name: STORAGE_ACCOUNT_ACCESS_KEY 
    valueFrom: 
      secretKeyRef: 
        name: tempo-traces-stg-key 
        key: tempo-traces-key   

compactor: 
  extraArgs: 
  - "-config.expand-env=true" 
  extraEnv: 
  - name: STORAGE_ACCOUNT_ACCESS_KEY 
    valueFrom: 
      secretKeyRef: 
        name: tempo-traces-stg-key 
        key: tempo-traces-key   

ingester: 
  extraArgs: 
  - "-config.expand-env=true" 
  extraEnv: 
  - name: STORAGE_ACCOUNT_ACCESS_KEY 
    valueFrom: 
      secretKeyRef: 
        name: tempo-traces-stg-key 
        key: tempo-traces-key   

querier: 
  extraArgs: 
  - "-config.expand-env=true" 
  extraEnv: 
  - name: STORAGE_ACCOUNT_ACCESS_KEY 
    valueFrom: 
      secretKeyRef: 
        name: tempo-traces-stg-key 
        key: tempo-traces-key 

queryFrontend: 
  extraArgs: 
  - "-config.expand-env=true" 
  extraEnv: 
  - name: STORAGE_ACCOUNT_ACCESS_KEY 
    valueFrom: 
      secretKeyRef: 
        name: tempo-traces-stg-key 
        key: tempo-traces-key

As we are running in distributed mode, we need to configure the extra arguments for each of the services that will be connecting to our storage account. This means that we have to apply the configuration to the following services:

distributor
compactor
ingester
querier
query-frontend

Clearly we don't want to expose the value for our Azure Storage Account key in our Helm chart. We are therefore pointing to a secret which contains the value. Obviously this means that we will need to create a secret which has the value for our storage account key. We will need to create this in the same namespace as our tempo instance:

 
kubectl create secret generic tempo-traces-stg-key --from-literal=tempo-traces-key=<your-key> -n grafana-tempo

We are only collecting oltp traces, so in our configuration we have defined values for oltp but set Zipkin, Jaeger and OpenCensus to false.

Installation

We are now ready to install our Temp Helm chart:

 
helm -n grafana-tempo install tempo grafana/tempo-distributed --values D:\Data\Development\git\Grafana\Tempo\custom-values.yaml

If the command completes successfully you will see a summary of the components that have been installed:

Click on image to enlarge

The next thing to do is verify that the pods are running successfully:

 
 kubectl -n grafana-tempo get pods

You should see something like this:

Click on image to enlarge

Before looking at our trace data, let us just quickly recap our basic pipeline configuration. Our application is configured to emit telemetry to the OpenTelemetry collector, and sends traces to http://otel-collector-opentelemetry-collector:4317.

The Collector is configured to export to the local Tempo Distributor service. This is the configuration in our daemonset-values.yaml file:

Click on image to enlarge

Viewing Traces in Grafana

Tempo is essentially a store - if we want to see our traces we will need to use a visualisation tool such as Grafana. We are going to be following the instructions on this page: to run a quick and easy installation of Grafana onto our cluster. We are going to call our namespace 'grafana-main' rather than 'my-grafana' as used in the documentation.

Once Grafana is running on our cluster, we will set up a new connection to point to our Tempo instance:

Click on image to enlarge

The connection Url we are using is http://tempo-query-frontend.grafana-tempo.svc.cluster.local:3100

if you are running Tempo in microservices mode, tempo-query-frontend is the name of the service to connect to when creating a new connection in Grafana. In our case, this service is running on port 3100 - although that may not always be the case.

We need to append this segment: 'grafana-tempo.svc.cluster.local' to the name of our service because Tempo is running in a separate namespace to Grafana.

Now we can click on Save and Test and open up the Explorer

We are going to run a very simple query to retrieve any traces with a length greater than 20ms. We are going to filter by Service Name (this is the name that we specified for our service in our program.cs file).

Click on image to enlarge

If we click on the Run Query button we will see a table with traces that meet our criteria:

Click on image to enlarge

We can then start to drill down and see some rich telemetry on our http requests:

Click on image to enlarge

That's it - we now have an end to end pipeline for emitting, collecting, storing and viewing our traces!

Collector Telemetry

As you would probably expect, you can also configure how the Collector actually emits telemetry about itself.

This is managed in the service/telemetry section of the YAML definition of the collector:

 
service: 
  telemetry: 
    logs: 
      level: DEBUG 
      initial_fields: 
        service: my-instance 
    metrics: 
      level: detailed 
      address: 0.0.0.0:8888

You can find more information on this topic here. You can configure logs and metrics for the service but not profiling or traces.

Note that it’s possible to scrape the metrics by using a Prometheus receiver within the Collector configuration so that we can consume the Collector’s metrics at the backend. For example:

 
receivers: 
  prometheus: 
    trim_metric_suffixes: true 
    use_start_time_metric: true 
    start_time_metric_regex: .* 
    config: 
      scrape_configs: 
        - job_name: 'otel-collector' 
          scrape_interval: 5s 
          static_configs: 
            - targets: ['127.0.0.1:8888'] 

exporters: 
  otlp: 
    endpoint: my.company.com:4317 
    tls: 
      insecure: true 

service: 
  pipelines: 
    metrics: 
      receivers: [prometheus] 
      exporters: [otlp]

The OpenTelemetry documentation suggests that some of the key service health indicators to monitor are:

the rate of accepted vs. refused data (the health of your receivers)
the rate of sent vs failed exports (the health of your exporters)
queue length
retry count

Final Thoughts

We have done quite a lot of work so far but at present we still only have a minimal set up. In our source code we are not defining any custom metrics. Equally, in our Collector we have not defined any sampling or filters. We also do not have endpoints for collecting logging or metrics. Each of these will require adding further layers of configuration to our Collector as well as provisioning further resources to our AKS cluster.

Even when we have configured all of our signals and set up all of our endpoints there are still further issues to consider. There are many different patterns just for deploying the Collector. It can be run as a deployment or as a DaemonSet and can also be created as an Operator. We can configure dedicated Collectors for each instance type on run them on different clusters. We can even compile custom versions of the Collector to package only the functionality we need and therefore reduce its footprint. As we mentioned at the beginning, oTel is a huge framework. Mastering it takes time and managing it is an ongoing process.

References

Grafana Tempo Helm Charts:

https://grafana.com/docs/helm-charts/tempo-distributed/next/get-started-helm-charts/

https://grafana.com/docs/tempo/latest/setup/set-up-test-app/

List of all Tempo config options:
https://github.com/grafana/helm-charts/blob/main/charts/tempo-distributed/README.md

OpenTelemetry SDK Config:
https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/

OpenTelemetry Collector Helm chart
https://opentelemetry.io/docs/kubernetes/helm/collector/

The Tempo manifest:
https://grafana.com/docs/tempo/latest/configuration/manifest/

Other docs
https://grafana.com/blog/2023/12/18/opentelemetry-best-practices-a-users-guide-to-getting-started-with-opentelemetry

https://grafana.com/blog/2023/11/21/do-you-need-an-opentelemetry-collector

Open Telemetry Tracing With Grafana Tempo

The OpenTelemetry Collector

Requirements for this exercise

When to use the Collector

An OpenTelemetry Observability Pipeline

.NET Configuration

OpenTelemetry Packages

What the Packages Do

Initialisation

OpenTelemetry Logging

The OpenTelemetry Collector

KubeStats Service

Traces - Grafana Tempo

Set Up

Storage

Installation

Viewing Traces in Grafana

Collector Telemetry

Final Thoughts

References

Like this article?