Metrics with the OpenTelemetry Collector

Last updated: 2024-01-30

In this article we will look at using The OpenTelemetry collector to capture metrics emitted by a .NET application running in an Azure AKS cluster. The metrics will then be exported to Prometheus and viewed in Grafana

Requirements for this exercise
If you want to follow along and run the code referenced here you will need the following tools:

.NET 7
Visual Studio Community 2022
Access to an Azure AKS cluster
Helm
Kubectl

.NET Configuration

OpenTelemetry Packages

The first thing we need to do is add the relevant OpenTelemetry packages to our project. These are:

OpenTelemetry.Extensions.Hosting 1.60
OpenTelemetry.Instrumentation.AspNetCore 1.60-rc.1
OpenTelemetry.Instrumentation.Process 0.5.0-beta.3
OpenTelemetry.Exporter.Console 1.60
OpenTelemetry.Exporter.OpenTelemetryProtocol 1.60

Initialisation

After adding our packages, we next need to configure our telemetry options using the Application Builder. OpenTelemetry is a granular framework and consists of a number of different features which can be switched on or off. We need to specify which telemetry signals we wish to capture as well as specifying where the telemetry should be sent.

 
var builder = WebApplication.CreateBuilder(args); 
const string serviceName = "relay-service"; 
string oTelCollectorUrl = builder.Configuration["AppSettings:oTelCollectorUrl"]; 

Action<ResourceBuilder> appResourceBuilder = 
    resource => resource 
        .AddDetector(new ContainerResourceDetector()) 
        .AddService(serviceName); 

builder.Services.AddOpenTelemetry() 
      .ConfigureResource(appResourceBuilder)   
      .WithMetrics(metrics => metrics 
          .AddAspNetCoreInstrumentation() 
           .AddOtlpExporter(options => 
           { 
               options.Endpoint = new Uri(oTelCollectorUrl); 
               options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc; 
           }) 
          .AddConsoleExporter()); 

var app = builder.Build();

In the above code block, we have configured our EndPoint by passing in our 'oTelCollectorUrl'. Our Collector is running in the same cluster as the service so the url will look something like "http://otel-collector-opentelemetry-collector.default.svc.cluster.local:4317". GRPC is the default protocol for OpenTelemetry, although you can also use Http. As you can see from the lines below, once our application is deployed it will emit metrics signals to the OpenTelemetry Collector as well as to the Console.


.AddOtlpExporter(options => 
{ 
    options.Endpoint = new Uri(oTelCollectorUrl); 
    options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc; 
}) 
.AddConsoleExporter());

Understanding the Output

If you run your application locally and look at the Debug output you will be able to view the actual telemetry being emitted to the Collector. You will see a stream of output with entries sduch as the following:

 
 (2024-01-06T18:21:06.1819700Z, 2024-01-06T18:22:06.1557673Z] http.request.method: POST http.response.status_code: 200 http.route: Relay/SendMessage network.protocol.version: 1.1 url.scheme: https Histogram 
Value: Sum: 0.2956234 Count: 4 Min: 0.0233219 Max: 0.1224156  
(-Infinity,0]:0 
(0,0.005]:0 
(0.005,0.01]:0 
(0.01,0.025]:1 
(0.025,0.05]:0 
(0.05,0.075]:1 
(0.075,0.1]:1 
(0.1,0.25]:1 
(0.25,0.5]:0 
(0.5,0.75]:0 
(0.75,1]:0 
(1,2.5]:0 
(2.5,5]:0 
(5,7.5]:0 
(7.5,10]:0 
(10,+Infinity]:0

In case you are wondering what this means, here is a quick breakdown:

Time Range:
(2024-01-06T18:21:06.1819700Z, 2024-01-06T18:22:06.1557673Z) This obviously indicates the time range over which these measurements were taken.
HTTP Request Metadata:
- http.request.method: POST specifies that the HTTP request method was POST.
- http.response.status_code: 200 indicates that the response status code was 200 - i.e. successful.
- http.route: Relay/SendMessage tells us the route or endpoint that was called is Relay/SendMessage.
- network.protocol.version: 1.1 indicates the network protocol version used, which in this case is HTTP/1.1.
- url.scheme: https shows that the URL scheme used was HTTPS.
Histogram Data: This part shows the distribution of response times for the requests:
- Value: Sum: 0.2956234 Count: 4 Min: 0.0233219 Max: 0.1224156 gives a summary of the data:
  - Sum: 0.2956234 is the total of all the recorded times.
  - Count: 4 indicates that there were four measurements.
  - Min: 0.0233219 and Max: 0.1224156 show the minimum and maximum response times, respectively.
- The following lines represent the histogram bins, showing how many requests fell into each time range (in seconds):
  - (-Infinity,0]:0 means no requests took less than or equal to 0 seconds.
  - (0,0.005]:0 means no requests took between 0 and 0.005 seconds.
  - (0.005,0.01]:0 and so on.
  - (0.01,0.025]:1 means one request took between 0.01 and 0.025 seconds.
  - This pattern continues, showing the distribution of response times in various ranges.

This histogram is useful for understanding the performance characteristics of your HTTP requests, such as how many are fast, moderate, or slow, based on the response time categories you've defined. The histograms are based on the default buckets defined in the OpenTelemetry SDK.

The OpenTelemetry Collector

We are going to be running the Collector in an Azure AKS cluster and, as per the OpenTelemetry guidance, we will install this as a daemonset so that it is available on each node. We are going to deploy the Collector using Helm, so the first thing we need to do is add the OpenTelemetry Helm charts to our Helm repository:

 
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

Next, we need to define our own custom values for the Collector. We are enabling:

kubernetesAttributes
kubeletMetrics
logs collection.

Our values chart will look like this:

 
 mode: daemonset 
 
presets: 
  # enables the k8sattributesprocessor and adds it to the traces, metrics, and logs pipelines 
  kubernetesAttributes: 
    enabled: true 
  # enables the kubeletstatsreceiver and adds it to the metrics pipelines 
  kubeletMetrics: 
    enabled: true 
  # Enables the filelogreceiver and adds it to the logs pipelines 
  logsCollection: 
    enabled: true 
## The chart only includes the loggingexporter by default 
## If you want to send your data somewhere you need to 
## configure an exporter, such as the otlpexporter 
config: 
  exporters:
    prometheus: 
      endpoint: "0.0.0.0:9090" 
      namespace: "default" 
  service: 
    telemetry: 
      logs: 
        level: "debug" 
    pipelines: 
      metrics: 
        exporters: [ otlp, prometheus ] 
#     logs: 
#       exporters: [ otlp ] 
service: 
  # Enable the creation of a Service. 
  # By default, it's enabled on mode != daemonset. 
  # However, to enable it on mode = daemonset, its creation must be explicitly enabled 
  enabled: true 
 
  type: LoadBalancer 
  loadBalancerIP: ""  
  annotations:    
    service.beta.kubernetes.io/azure-load-balancer-internal: "true" 
 
ports: 
  prometheus: 
    enabled: true 
    protocol: TCP 
    containerPort: 9090 
    servicePort: 9090 
    hostPort: 9090

Let's just break this down a bit. On the lines below we configure a metrics exporter. This will export metrics to an instance of Prometheus running on the same cluster.

config: 
  exporters:
    prometheus: 
      endpoint: "0.0.0.0:9090" 
      namespace: "default"

On the lines below we are configuring the Collector to emit logging about itself


  service: 
    telemetry: 
      logs: 
        level: "debug"

On the lines below we are adding our otlp exporter to our pipeline


    pipelines: 
      metrics: 
        exporters: [ otlp, prometheus ]

On the lines below we are creating a Load Balancer Service which will only be accessible inside the cluster. By default, Load Balancers have a public IP address assigned to them. We are overriding this by setting the Load Balancer IP to an empty string. We are also adding an Azure AKS specific annotation to make the Load Balancer internal only. Different cloud providers will have different ways of implementing this requirement.


  type: LoadBalancer 
  loadBalancerIP: ""  
  annotations:    
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"

In the lines below we are exposing port 9090 so that metrics can be scraped by Prometheus


ports: 
  prometheus: 
    enabled: true 
    protocol: TCP 
    containerPort: 9090 
    servicePort: 9090 
    hostPort: 9090

We can now install the Collector into our cluster:

 
helm install otel-collector open-telemetry/opentelemetry-collector --values values.yml

KubeStats Service

When we installed the Collector using this configuration we found that the kubestats service receives this error trying to scrape metrics from the kubelet:

 
" otel-collector-opentelemetry-collector-agent-q97ql opentelemetry-collector 2023-12-09T14:31:16.104Z     error   scraperhelper/scrapercontroller.go:200  Error scraping metrics  {"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "error": "Get \"https://aks-nodepool1-16446667-vmss000006:10250/stats/summary\": tls: failed to verify certificate: x509: certificate signed by unknown authority", "scraper": "kubeletstats"}"

We resolved this by updating the configmap for the kubestat service by setting the insecure_skip_verify option to true. This obviously may not be desirable for production environments

 
  kubeletstats: 
     auth_type: serviceAccount 
     collection_interval: 20s 
     endpoint: ${env:K8S_NODE_NAME}:10250 
     insecure_skip_verify: true

We have our OpenTelemetry collector up and running and it is receiving and exporting logs and metrics. Although the Collector is a highly sophisticated service, it is important to remember that it is, essentially, a relay. It has no UI and it does not store telemetry. It collects and forwards telemetry signals. The storage and visualisation of those signals are functions performed by tools further downstream in the overall telemetry pipeline. We will now look at installing and configuring Prometheus to scrape metrics from our Collector.

Prometheus

Installing prometheus

We don't like installing K8S applications into the default namespace so the first thing we are going to do is create a namespace for our Prometheus service:


 kubectl create namespace prometheus

Next we need to install the Prometheus Helm chart into our Helm repo. We are going to use the prometheus community chart:


helm repo add prometheus-community https://prometheus-community.github.io/helm-charts   
 
helm repo update

Before installing the Helm chart into our cluster we will need to update the scrape config so that Prometheus scrapes metrics from our OpenTelemetry endpoint. To do this we will first of all download the values yaml file from the Helm Github repo:

https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus/values.yaml

Next we will update the scrape config to set our OpenTelemetry Collector as a target:

    
    scrape_configs:  
      - job_name: prometheus 
        static_configs: 
          - targets: 
            - localhost:9090 
 
      - job_name: 'otel-collector' 
        honor_labels: true 
        static_configs: 
          - targets: ['otel-collector-opentelemetry-collector.default.svc.cluster.local:9090']

and now we will install Prometheus into our cluster:

 
helm install prometheus prometheus-community/prometheus --values values.yaml -n prometheus

Verifying the installation

Next we will verify that our installation was successful:


kubectl get pods -n prometheus

If the install was successful you should see something like this:

Click on image to enlarge

The installation will also create a Kubernetes service called Prometheus Server. We can port-forward this and then connect to its web ui:


 kubectl port-forward svc/prometheus-server 8080:80 -n prometheus

The UI will look something like this:

Click on image to enlarge

As you can see, on the start page there is a text box where can query our metrics by entering a valid promQL expression. The simplest query is just to enter the name of a metric:

Click on image to enlarge

We can then refine the query by adding filters on the label values in our series. For example, we can filter by job name:

Click on image to enlarge

If we click on Status/Targets we will see a table showing all of the targets from which our Prometheus instance is scraping metrics. If you scroll down the page you should be able to confirm that Prometheus is successfully scraping metrics from our OpenTelemetry Collector:

Click on image to enlarge

If you click on the Metrics Explorer button on the right hand side of the screen:

You will be able to see a full list of the metrics that are available to search on:

Click on image to enlarge

Labels

One thing you will immediately notice is that, even with a relatively minimal implementation, you are immediately faced with a very large quantity of metrics. Labels are a vital concept in metrics and they are the key to making sense of and querying this data. Luckily, Prometheus provides us with a a set of metadata API's where we can list metadata such as the labels which are currently in use.

We are not publicly exposing our Prometheus instance. However, we can still access the API on port 9090 of the container running our Prometheus instance. To do this we wull kick off a debug session in kubectl:

 
 kubectl debug prometheus-server-5f8f75bd86-tjsxv -n prometheus -it --image=curlimages/curl -- sh

In the above code, you can substitute prometheus-server-5f8f75bd86-tjsxv for the name of your pod. When running this command you need to specify the name of an image for the debug container. As we want an image with curl installed we are going for the obvious option of a curl image. This also has the advantage of being very lightweight.

The debug command will open an interactive shell in the container and we can now run a query against the labels API:

 
 curl 'localhost:9090/api/v1/labels'

This should return a full list of current labels:

Click on image to enlarge

We can also use the label api to find values for a specific label:


 curl localhost:9090/api/v1/label/service_name/values

We have verified that our prometheus service is up and running and able to scrape from our OpenTelemetry instance. Our next step is to view our metrics in Grafana.

Viewing Metrics In Grafana

We are going to be following the instructions on this page: https://grafana.com/docs/grafana/latest/setup-grafana/installation/kubernetes/ to run a quick and easy installation of Grafana onto our cluster. We are going to call our namespace 'grafana-main' rather than 'my-grafana' as used in the documentation.

Once Grafana is running on our cluster we will set up a new connection to point to our Prometheus instance. The Grafana UI makes this a realy simple process. All we ned to do is add a new Connection and set our Prometheus instance as the source.

Click on image to enlarge

Once we select Prometheus we need to do very little further configuration. We will accept the default name for the connection (prometheus) and then we just need to enter the Prometheus server url.

Click on image to enlarge

The full value of our url is: http://prometheus-server.prometheus.svc.cluster.local:80

This breaks down as follows:
prometheus-server: this is the name of our Kubernetes prometheus Service. You will see this if you look at the Services and Ingresses side menu option for the AKS cluster in the Azure Portal:

Click on image to enlarge

prometheus: this is the namespace into which we have installed our Prometheus instance

svc.cluster.local:80: The Prometheus API runs on port 80

Although there are numerous other configuration options on this screen, we can get started by just entering our Prometheus url and then clicking on the Save & test button:

Click on image to enlarge

If all goes well you will see a success notification like this:

Click on image to enlarge

Let us click on the "building a dashboard" link and then click on the "Add Visualization" button:

Click on image to enlarge

We will obviously select Prometheus as our data source and then we will see a screen where we can define our visualization. We do this by defining a

metric
label
value

from the drop down boxes:

Click on image to enlarge

If you expand the Metric Dropdown list you will see a very long list of possible metrics. For this exercise we are going to select the request duration metric:

Click on image to enlarge

Next we need to define our Label filters. You may remember that, in the program.cs file for our application, we defined an OpenTelemetry service name and assigned it a value of "relay-service". The service name label resolves to the "exported_job" label. We will therefore select that label and then select "relay-service" as our value:

Click on image to enlarge

If you now click on Save, Grafana will spin up a cool visualization:

Click on image to enlarge

As you can see, the Grafana visualisation defaults to a line graph. As we are looking at a set of discrete events, we are going to change the graph style to Points:

Click on image to enlarge

This now looks a lot more intuitive:

Click on image to enlarge

When looking at OpenTelemetry metrics in a Grafana point chart, you may be struck by a whole series of points appearing in a straight line. Given that Http request durations are measured in milliseconds, it would seem highly improbable that a succession of http requests all had exactly the same duration. The explanation for this is that OpenTelemetry metrics are grouped into buckets - so we are looking at a histogram rather than a scattergraph. We saw this earlier when we looked at the output being emitted by the OpenTelemetry SDK. The fundamental reason for this is obviously that OpenTelemetry sends data as a sample rather than sending every single metric for every request down to the millisecond.

There is one further UI issue - and that is that the legend is very cluttered. Our metric is composed of multiple labels and most of them do not really need visibility. Equally, the text of some of the labels is not particularly user-friendly. We are going to transform the text to make it less verbose and more readable. There are a couple of options for doing this but we are going to look at using the PromQL 'label_replace' function. There are some labels we want to reword and others we want to omit altogether. The label_replace function enables us to achieve both of these objectives. We are going to replace the default query with this new query, which will help us to streamline our legend labels;

 
label_replace( 
label_replace( 
label_replace( 
  label_replace( 
    default_http_server_request_duration_seconds_count{exported_job="relay-service"}, 
    "Method", 
    "$1", 
    "http_request_method", 
    "(.+)" 
  ), 
  "job", 
  "", 
  "job", 
  "(.+)" 
), 
 "instance", 
  "", 
  "instance", 
  "(.+)" 
), 
 "http_request_method", 
  "", 
  "http_request_method", 
  "(.+)" 
)

This query carries out the following transformations to the labels returned to the Visualization

Adds a label called "Method" to represent the value for the "http_request_method" label
Removes the "http_request_method" label
Removes the "job" label
Removes the "instance" label

On the one hand, multiple nested functions are not pretty from a coding point of view. At the same time, this does provide a relatively simple and powerful mechanism for tidying up our display of labels.

At this point we are going to take a break as we have achieved quite a lot here and have put in place the building blocks for a really powerful solution. Just to recap, we now have:

.NET applications instrumented to emit metrics
an OpenTelemetry Collector which can capture and forward the metrics
a Prometheus instance to store our metrics
a Grafana instance to view our metrics

Once again, all of the tools we have used are fully open source and we have full control over the sampling, filtering, retention and scaling of every aspect of our telemetry solution. In future articles we will build on this foundation and harness the extensive capabilities of the OpenTelemetry framework to build out more powerful and sophisticated solutions.

References

Prometheus API:
https://prometheus.io/docs/prometheus/latest/querying/api/#getting-label-names
https://prometheus.io/docs/prometheus/latest/querying/api/

Prometheus configuration:
https://prometheus.io/docs/prometheus/1.8/configuration/configuration/

OpenTelemetry Semantic Conventions For HTTP metrics:

https://github.com/open-telemetry/semantic-conventions/blob/v1.23.0/docs/http/http-metrics.md

https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors

https://blog.frankel.ch/opentelemetry-collector/

Metrics with the OpenTelemetry Collector

.NET Configuration

OpenTelemetry Packages

Initialisation

Understanding the Output

The OpenTelemetry Collector

KubeStats Service

Prometheus

Installing prometheus

Verifying the installation

Labels

Viewing Metrics In Grafana

References

Like this article?