The Loki Lowdown

Last updated: 2024-04-22

Loki is Grafana's Open Source log aggregation system. Among its principal features are:

  • no log formatting requirements on ingestion
  • 100% persistence to object storage
  • integration with Prometheus and K8S

One of the first things that you will learn about Loki is that it is not a trivial system. It is an enterprise level system designed for ingesting logs at scale. Whilst logging may appear to be simple at a conceptual level, building log aggregation systems which work at large scale is actually quite a complex undertaking. Log aggregation systems are subject to many of the same constraints and complexities as database systems. Setting up an enterprise logging solution means making decisions about storage, ingestion and querying. You also have to think about parameters that define consistency, validation, reconciliation and compacting. In effect, you need to have an understanding of the Loki architecture and be prepared to make both high-level and low-level engineering decisions.

One distinctive aspect of Loki is that it only indexes on the meta data rather than the full text of a log line. That means that it will index fields such as the timestamp and any labels, but it will not index the actual log text. This approach provides the following benefits:

  • faster writes
  • smaller indexes
  • lower costs

This is great but, as always in IT there are trade-offs. In this case, logging systems such as Elastic, which do full-text indexing can potentially provide faster and more flexible querying.

Loki Architecture

Loki is a sophisticated log aggregation and query platform which has multiple components and is carefully architected for large scale ingestion and economical storage. The main components that we are interested in are:

Distributor

The distributor service is responsible for handling incoming push requests from clients. It's the first stop in the write path for log data. Once the distributor receives a set of streams in an HTTP request, each stream is validated for correctness and to ensure that it is within the configured tenant (or global) limits. Each valid stream is then sent to n Ingesters in parallel, where n is the replication factor that has been configured for the system.

Ingester

The ingester service is responsible for persisting data and shipping it to long-term storage (Amazon Simple Storage Service, Google Cloud Storage, Azure Blob Storage, etc.) on the write path, and returning recently ingested, in-memory log data for queries on the read path.

Each log stream that an ingester receives is built up into a set of many 'chunks'' in memory and flushed to the backing storage backend at a configurable interval.

Chunks are compressed and marked as read-only when:

  1. The current chunk has reached capacity (a configurable value).
  2. Too much time has passed without the current chunk being updated
  3. A flush occurs.

Query Frontend

The query frontend is an optional service providing the querier's API endpoints and can be used to accelerate the read path. When the query frontend is in place, incoming query requests should be directed to the query frontend instead of the queriers. The querier service will be still required within the cluster, in order to execute the actual queries.

Querier

The querier service is responsible for executing Log Query Language (LogQL) queries. The querier can handle HTTP requests from the client directly (in 'single binary'' mode, or as part of the read path in a 'simple scalable deployment'') or pull subqueries from the query frontend or query scheduler (in 'microservice'' mode).

Compactor

The compactor service is used by 'shipper stores', such as single store TSDB or single store BoltDB, to compact the multiple index files produced by the ingesters and shipped to object storage into single index files per day and tenant. This makes index lookups more efficient.

To do so, the compactor downloads the files from object storage in a regular interval, merges them into a single one, uploads the newly created index, and cleans up the old files. If you look at the folder structure in your storage account you will see this process in action, as multiple smaller files are merged into single larger ones.

Storage

Loki stores all data in a single object storage backend, such as Amazon Simple Storage Service (S3), Google Cloud Storage (GCS), Azure Blob Storage, among others. This mode uses an adapter called index shipper (or short shipper) to store index (TSDB or BoltDB) files the same way we store chunk files in object storage.

Memcached

Caching is an extremely important strategy for managing system performance and scalability. Loki supports caching of index writes and lookups, chunks and query results. For high-performance scenarios you can optionally configure Loki to write and read to a MemCached instance. You can find out more about the YAML configuration here.

Canary

The Grafana Loki Canary module is a standalone application designed to audit the log-capturing performance of a Grafana Loki cluster. It generates artificial log lines and sends them to Loki, then queries Loki to check for these logs, providing metrics on Loki's log ingestion and query performance. This helps identify issues like data loss, ensuring Loki is functioning properly. For more detailed insights and setup instructions, you can refer to the official documentation on the

Grafana Agent

The Grafana Agent (now Grafana Alloy) is a customised instance of the OpenTelemetry Collector and is installed by default when running in simple scalable mode. We are going to be running our own vanilla version of the otel Collector, so we will disable the Agent.

Modes

Simple Scalable Mode

Loki can be run in three different modes. These are

  • Monolithic
  • Simple Scalable
  • Microservices
We are going to be using the Simple Scalable mode as it offers a nice balance between flexibility and ease of maintenance. We will be installing it using this Helm Chart. The default Helm chart deploys the following components in Simple Scalable Mode:
  • Read component (3 replicas)
  • Write component (3 replicas)
  • Backend component (3 replicas)
  • Loki Canary (1 DaemonSet)
  • Gateway (1 NGINX replica)
  • Minio (optional, if minio.enabled=true)
  • Grafana Agent Operator + Grafana Agent (1 DaemonSet) - these are configured to monitor the Loki application itself and are optional.
The topology of a Loki Simple Scaalable installation is divided into three conceptual spheres:
  • Read
  • Write
  • Backend
The read sphere contains the Querier and Query Frontend services. The write sphere contains the Distributor and Ingestor services. The Backend consists of the Gateway, the Scheduler, the Ruler and the Compactor

Labels, Streams and Chunks

Loki may be potentially ingesting vast volumes of data, which needs to be broken down into manageable units and organised somehow. In Loki, the organising principle is that of a stream

Labels

Labels are a critical concept in Loki as they are the way in which streams are defined. When logs are being ingested, each unique combination of labels and values constitutes a stream. We are not using a tool such as Promtail or Fluentbit to upload our logs. We therefore have two principal options for inserting labels into our logs

  1. in our OpenTelemetry Configuration in our code
  2. in our OpenTelemetry Collector

We can use one either one of these methods or we can use both.

Chunks

Grafana Loki has two main file types: index and chunks.

  • The index is a table of contents of where to find logs for a specific set of labels.
  • The chunk is a container for log entries for a specific set of labels.

Once streams have been defined, the log portion of the stream is saved to a chunk. Chunks have a maximum size and a lifespan. The purpose of the lifespan is to ensure that data is not left waiting too long before being committed to disk. Data is flushed from chunks to disk either when the chunk is full or when the expiry occurs.

Index format

There are two index formats that are currently supported:

  • TSDB

    Time Series Database (or short TSDB) is an index format originally developed by the maintainers of Prometheus for time series (metric) data.

  • BoltDB

    Bolt is a low-level, transactional key-value store written in Go.

Planning

Conceptually, saving and querying application logs may seem to be relatively simple functions - and in some ways they are. However, whilst the process of writing maybe conceptually simple, the functional demands placed on logging systems have important architectural implications:

  • logs need to be ingested at huge scale
  • logs need to be stored economically
  • logs need to be parsed and indexed
  • logs need to be queries economically

This means that log aggregation systems have architectures consisting of multiple layers and components to address each of these functions. This, in turn, means that users have to consider these architectures when planning a system implementation. To achieve a scalable and performant system requires a certain amount of familiarity with these architectures as well as an understanding of your own logging flows and business needs. If you are running Loki in a commercial cloud environment, you will need to be aware of the different SKU's and options for storage and scalability and have an awareness of costs.

Before installing Loki you will need to have a clear understanding of your logging volumes and peak ingestion rates. You will need to make sure that you have specified a sufficient number of replicas and have benchmarks for CPU and memory requirements. You will then also need to calculate potential storage costs and define policies for archiving in cold storage and retention. You should place a budget on the Resource Group for your containers and configure alerts. Spikes and surges can occur easily and can result in very significant costs.

Walkthrough

Configuration

In this article we look at instrumenting a .Net web app running in Kubernetes to send logs to Loki. There are a couple of different architectural options for sending application logs to Loki. The two main approaches could be broken down as:

  • use a tool such as Promtail
  • forwarding logs via the OpenTelemetry Collector.

Naturally, we could actually mix the architecture up a bit further by actually configuring Promtail (or a similar tool) to send logs to the Gateway. In this article, we will be using the second approach - sending logs via the oTel collector.

Environment We will be running our app services, our OpenTelemetry Collector and our Loki instances in an Azure Kubernetes Service (AKS) cluster.

Edition/Mode We will be running the OSS edition of Loki in Simple Scalable Mode. The simple scalable deployment mode can scale up to a few TBs of logs per day. However, if you exceed this kind of volume, the microservices mode will be a better choice.

Loki on Azure

Installing Loki on Azure can initially be problematic. This is because the documentation in critical areas such as storage configuration tends to focus mostly on AWS and GCP, leaving Azure users to fend for themselves. Naturally, the Grafana documentation is open source, so it is maybe incumbent on the Azure community to rectify this.

Installation

Before we start, we will need to create an Azure Storage Account. Once the Account has been created we will add a container called loki.

Next, in Kubernetes, we will create a namespace

kubectl create namespace loki  

Then we will add the repo into Helm and get the latest version:

  
helm repo add grafana https://grafana.github.io/helm-charts 

helm repo update 
 

Next, we need to configure the storage values for our chart. We will do this by creating a custom values file. Create a new file called az-values.yaml and copy in the following:

  
loki: 
  auth_enabled: false 
 
  common: 
    path_prefix: /var/loki     
 
    ring: 
      kvstore: 
        store: memberlist 
 
  storage: 
    type: 'azure' 
 
  compactor: 
    working_directory: /var/loki/retention 
    shared_store: azure 
    retention_enabled: true   
 
  schema_config: 
    configs:   
      store: boltdb-shipper 
      object_store: azure 
      schema: v11 
      index: 
        prefix: loki_index_ 
        period: 24h 
 
 
  storage_config: 
    boltdb_shipper: 
      shared_store: azure 
      active_index_directory: /var/loki/index 
      cache_location: /var/loki/cache 
      cache_ttl: 24h 
 
    azure: 
      container_name: loki 
      account_name: <name> 
      use_managed_identity: false 
      account_key: <"key"> 
      request_timeout: 0       
 
    filesystem: 
      directory: /var/loki/chunks 
monitoring: 
  selfMonitoring: 
    grafanaAgent: 
      installOperator: false 
 
 

In the example above we have set the value for "auth_enabled" to false. This is for the purposes of convenience but may not be appropriate to a production environment. Getting Loki up and running on Azure is not always straightforward and it can involve a fair amount of trial and error. You need to make sure that the values for each of the following attributes is set to 'azure':

  • storage.type
  • compactor.shared_store
  • schema_config.configs.object_store
  • storage_config.boltdb_shipper.shared_store

Now we will run our Helm install:

  
helm install -n loki --values az-values.yaml loki grafana/loki  

If all goes well you should see a response like this:

Sending Telemetry

We now have our Loki instance up and running - the next thing we need to do is instrument a service so that it sends logs to our oTel Collector

In our service we have configured oTel logging as follows:

builder.Logging.AddOpenTelemetry(options =>  
{ 
    options 
        .SetResourceBuilder( 
            ResourceBuilder.CreateDefault() 
                .AddService(serviceName)) 
        .AddConsoleExporter() 
        .AddOtlpExporter(options => 
        { 
            options.Endpoint = new Uri($"{oTelCollectorUrl}"); 
            options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc; 
        });  
});  

Next, we need to configure values for our OpenTelemetry Helm Chart. Loki now has full support for OpenTelemetry. This means that we can now send logging telemetry to Loki using the oltphttp exporter (previously we used a dedicated Loki exporter). You can find documentation on this here. As you can also see, we are sending our logs to the loki-write service.

For this exercise, we have switched off collection of Kubernetes metrics and logs. This is to make our telemetry streams more manageable, but this may not be an appropriate setting for a production environment. We have also created a Kubernetes service for our Collector, which runs in LoadBalancer mode. The annotations we have applied are Azure AKS specific and different notations will be needed for other cloud providers.

You will see that in the config/processors/resource section we have defined a number of attributes. These will be surfaced in Loki as labels. We can then use these for organising and querying our logs. As you can see, we have used the from_attribute keyword for renaming existing attributes and have used the value keyword for creating new attributes.

mode: daemonset  
presets: 
  # enables the k8sattributesprocessor and adds it to the traces, metrics, and logs pipelines 
  kubernetesAttributes: 
    enabled: false 
  # enables the kubeletstatsreceiver and adds it to the metrics pipelines 
  kubeletMetrics: 
    enabled: false 
  # Enables the filelogreceiver and adds it to the logs pipelines 
  logsCollection: 
    enabled: false 
## The chart only includes the loggingexporter by default 
## If you want to send your data somewhere you need to 
## configure an exporter, such as the otlpexporter 
config: 
  exporters:  
    otlphttp: 
      endpoint: "http://loki-write.loki.svc.cluster.local:3100/loki/api/v1/push" 
      tls: 
        insecure: true 

  processors: 
    resource: 
      attributes: 
        - action: insert 
          key: language 
          from_attribute: telemetry.sdk.language 
        - action: insert 
          key: service_name 
          from_attribute: service.name 
        - action: insert 
          key: service_namespace 
          from_attribute: service.namespace 
        - action: insert 
          key: service_version 
          value: 1.000 
        - action: insert 
          key: deployment_environment 
          value: production 
        - action: insert 
          key: loki.resource.labels 
          value: language,service_name,service_namespace,service_version,deployment_environment           
  service: 
    telemetry: 
      logs: 
        level: "debug" 
    pipelines: 
     logs: 
       exporters: [ loki, debug ] 
       processors: [batch, resource] 
       receivers: [ otlp ] 
service: 
  # Enable the creation of a Service. 
  # By default, it's enabled on mode != daemonset. 
  # However, to enable it on mode = daemonset, its creation must be explicitly enabled 
  enabled: true 
 
  type: LoadBalancer 
  loadBalancerIP: ""  
  annotations:    
    service.beta.kubernetes.io/azure-load-balancer-internal: "true" 

We can now install the OpenTelemetry Helm repo

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts  
 
helm repo update 
 

And deploy the Collector to our cluster:

kubectl create namespace otel-collector  
 
helm install otel-collector open-telemetry/opentelemetry-collector --values values.yml 
 

Grafana

We now have our microservices, our oTel Collector and our Loki instance running. The next thing we will do is spin up an instance of Grafana that we can use for viewing and querying our log data. Since the purpose of this exercise is to focus on Loki, we will restrict ourselves to running a simple Grafana set up, following the instructions on this page.

Once Grafana is up and running, we first of all need to connect to our Loki Data Source. This is just a matter of pointing to the url for our Loki Gateway service. Since we have have installed Loki in a namespace we enter the fully qualified path: http://loki-gateway.loki.svc.cluster.local.

Grafana Config

Now we can look at our log data. We are going to view logs for our relay service - so we will filter the labels by service name:

Once we click on the Run Query button we will see a bar chart of our logging volumes

as well as a panel with multiple viewing options for our raw log data.

Querying

Like many other log aggregation systems, Loki has its own log query language - LogQL. So let us use it to run some queries.

LogQL queries can be broken down into a set of steps, which form a pipeline. The output from each step is filtered or transformed and then passed to the next step. The steps are separated by the "|" (pipe) character.

The first step is to use Label Filters to specify our stream. The Label Filter is a Name-Value pair which is wrapped in curly brackets. When you are querying Loki in the Grafana UI there is a very handy Label browser which displays all of the current label/value combinations for your data source:

It will even generate a selector which you can use to select your stream. So this will be the first part of our query:

{deployment_environment="production",language="dotnet",service_name="relay-service"}

Next, we will specify a Line Filter. The Line Filter will filter the logs either against a string literal or by evaluating a regular expression. We want to find all lines which contain the string "Message cannot be sent". So our filter will simply be:

| ="Message cannot be sent"

So our full query is:

{deployment_environment="production",language="dotnet",service_name="relay-service"} | ="Message cannot be sent"

When we run the query, we can see that Grafana highlights our search text within the results:

One problem we have at the moment is that Grafana is interpreting our log line as just one large string without any internal structure. When we look at the line though, we can see that it is a json string consisting of a number of different fields:

{"body":"Message cannot be sent","traceid":"e515f9f65d4e6afb77fa1b3b2543a920","spanid":"5b5671aad21a81e3","severity":"Information","flags":1,"resources":{"container.id":"809419ef0269c518954681c6b70b10e2c3cc525d2f7946932bb837e461028c18","service.instance.id":"ec9059d9-cac5-44dc-9b60-8774b352b24a","service.name":"relay-service","telemetry.sdk.language":"dotnet","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.6.0"}} 

What we would like to do is also be able to filter on the json fields. For example, we might want to filter by severity or flags. Luckily, we can do this quite easily. All we need to do is extend our pipeline to tell Grafana to interpret our log as json and then specify a filter. We do this as follows:

| json |<fieldname>>="<value>" 

We are going to filter on the spanid field. So our full query is now:


{language="dotnet", service_name="relay-service"} |= "Message cannot be sent" | json |spanid="5b5671aad21a81e3" 


Our filter now just returns one result:

We can also take a similar approach to query lines which have been generated in standard format. For this, we just use the logfmt filter instead of the json filter.

If we run this query:

{app_kubernetes_io_instance="loki"} |= `` 

we will get a set of results returned in standard log format:

If we want to filter these results to just return lines where the caller is metrics.go:160, we can use the following query:

{app_kubernetes_io_instance="loki"} |= `` | logfmt | caller="metrics.go:160" 


This gives us the following output:

Our query has worked but the output is quite verbose. To fix this we can specify just a subset of fields in our query. To do this we can use the line_format function

 
{container="loki"} �| logfmt | line_format "{{.org_id}} {{.query}}" 


LogQL is a rich language capable of handling highly complex querying scenarios. We can really only scratch the surface here. You can find plenty of further guidance in the Grafana documentation. This document provides general guidance on querying within Grafana. This document provides guidance on using LoqQL.

The Grafana web site also has this very handy tool where you can test your LogQL on a small sample of log data.

Conclusion

Loki is a highly sophisticated, enterprise-grade logging solution. It is designed to handle workloads at huge scale. This means that installing and maintaining Loki instances is a serious commitment that requires careful planning and a solid understanding of Loki architecture. This will require some concentrated up-front effort as well as something of a learning curve. The payoff will be the capability to harness a scalable, robust and world-class logging solution.

Like this article?

If you enjoyed reading this article, why not sign up for the fortnightly Observability 360 newsletter. A wholly independent newsletter dedicated exclusively to observability and read by professionals at many of the world's leading companies.'

Get coverage of observability news, products events and more straight to your inbox in a beautifully crafted and carefully curated email.