Observability-driven development-reduce MTTR with confidence

Observability-Driven Development: Reduce MTTR With Confidence

John Gallagher

John Gallagher is a Principal Engineer at Dynatrace and independent programming coach

Fictional scenario - it's Monday morning and my team is paged.

The new feature we put live last week is broken in production. Customers are annoyed. Support is asking when a fix can be expected.

Logs don't tell us much. Traces are minimal. Three databases to check... and no idea where to start looking.

By the time there's an outage, it's too late. We're scrambling, guessing, and burning time.

Observability-driven development from Charity Majors has changed the way I build software.

My normal workflow used to be build, deploy, then tack on observability as an afterthought.

In ODD, I build observability into the codebase as I go and validate my code behaves as I expect in production.

Only then do I declare the feature done.

Example - CRM Sync

A few years ago I built a small piece of software to sync contacts between a CRM and an email newsletter.

We deployed, monitored for errors, everything seemed fine. Six months later, the client discovered that 17% of contacts weren't syncing.

The system was silently losing data. Contacts weren't flowing, leads weren't getting emails. Nobody knew.

I recently worked with a client who needed a similar CRM integration.

I'd learned my lesson. This time I built with Observability Driven Development.

Principles

Some ideas behind Observability Driven Development:

Embrace failure
Instrument incrementally
Close the loop

1. Embrace failure

I can't predict all the ways my code will break in production. Yes, I still try. Yes, I write extensive unit tests. But with complex distributed systems, there will always be strange behaviour in production.

Unit tests in the sandbox of my local laptop catch so many bugs.

However, software in the wild will always fail in new, interesting and unpredictable ways.

Instead of asking "Will this code work in production?", I ask "How will I know if this code breaks in production?".

2. Instrument incrementally

I wouldn't merge a pull request without tests. So I don't merge a pull request without instrumentation.

I add a small slice of observability alongside every feature I ship. Just like tests and documentation.

This small slice of observability helps me understand the question posed in step 1: "How will I know if this code breaks in production?".

3. Close the loop

After I deploy, I fire up my observability tool and query the data generated.

Is the code running? Are customers using it? Does the feature behave as expected? Does anything look weird?

I poke around and get curious. If I see something unexpected, I drill down, fix the problem, deploy and repeat.

I only declare the feature done when I can see the feature actually working in production.

Steps to Observable Software

To give us structure around these principles, I've come up with a five step process.

Steps to Observable Software

The five steps are:

Question - Form the question are you trying to answer.
Decide Data - Decide on which data is needed to answer the question.
Build - Write code to instrument your app and collect the data.
Use - Use your observability tool - logs, traces, metrics - to try to answer the question.
Reflect - What happened? Did you answer the question successfully? Or was there not enough data? Or is a change in direction needed? Return to the relevant step.

Worked Example

Let's see exactly how this works with the CRM sync example.

Step 1: Question

We want a fitness function for the feature.

How do I know if this feature is delivering value?

How will I know if the code is working?

I came up with three questions:

Question 1: What percentage of contacts have been successfully synced? (Is the feature working correctly?)

Question 2: How long does it take for each contact to sync? (Is the feature too slow?)

Question 3: How many contacts are being synced an hour? (Is the feature being used by customers?)

Step 2: Decide Data

When deciding on the data to collect, I break down each question into four dimensions:

A. Event - what event that happened in the past are we measuring?

B. Filter - what attributes are we filtering by?

C. Group by - what attributes are we grouping by?

D. Values - what values are we plotting?

Let's do this for each question.

Question 1: What percentage of contacts have been successfully synced?

A. Event - HTTP Request Sent

B. Filter - URL = "https://mailing.list/api/contacts" AND Method == POST

C. Group By - Status Code

D. Values - N/A

Question 2: How long does it take for each contact to sync?

A. Event - Contact Synced

B. Filter - N/A

C. Group By - N/A

D. Values - P95(duration)

Question 3: How many contacts are being synced an hour?

A. Event - Contact Synced

B. Filter - N/A

C. Group By - N/A

D. Values - COUNT

All Data Required

Summarising all this, we need:

HTTP Request Sent Event
- URL
- Method
- Status Code
Contact Synced Event
- Duration

Step 3: Build Instrumentation

I implemented the business logic for the feature using TDD:

class SyncContact
  def call(contact)
    # ... business logic here ...
    # Faraday is an HTTP client for Ruby
    client = Faraday.new("https://mailing.list") do |conn|
      conn.adapter Faraday.default_adapter
    end
    client.post("/api/contacts", contact)
  end
end

First event needed is "HTTP Request Sent".

Attributes needed are URL, Method, Status Code and Duration.

Wrote a failing test:

class TestInstrumentationMiddleware < Minitest::Test
  def test_records_http_request_sent_event
    logs = capture_logs do
      SyncContact.new.call(contact)
    end

    log = logs.find { |msg| msg["event.name"] == "http.request.sent" }
    assert_equal "https://mailing.list/api/contacts", log["http.request.url"]
    assert_equal "post", log["http.request.method"]
    assert_equal 200, log["http.response.status"]
    assert_kind_of Numeric, log["http.response.duration"]
  end
end

Wrote the implementation:

 class SyncContact
   def call(contact)
     # ... business logic here ...
     client = Faraday.new("https://mailing.list") do |conn|
+      conn.use(InstrumentationMiddleware.new)
       conn.adapter Faraday.default_adapter
     end
     client.post("/api/contacts", contact)
   end
  
+  class InstrumentationMiddleware < Faraday::Middleware
+    def call(env)
+      @app.call(request).on_complete do |response|
+        logger.info(
+          "Request to #{request[:url]} completed with status code #{response[:status]} and duration #{response[:duration_ms]}ms",
+          "event.name" => "http.request.sent",                # <== HTTP Request Sent Event
+          "http.request.url" => request[:url],                # <== URL
+          "http.request.method" => request[:method],          # <== Method
+          "http.response.status" => response[:status]         # <== Status Code
+        )
+      end
+    end
   end
 end

Test passes.

HTTP Request Sent Event
- URL
- Method
- Status Code
Contact Synced Event
- Duration

Next event needed is "Contact Synced".

Wrote a failing test:

class TestContactSyncedEvent < Minitest::Test
  def test_records_contact_synced_event
    logs = capture_logs do
      SyncContact.new.call(contact)
    end

    log = logs.find { |msg| msg["event.name"] == "app.contact.synced" }
    assert_kind_of Numeric, log["duration"]
  end
end

Wrote the implementation:

 class SyncContact
   def call(contact)
+    start_time = Time.now.to_i
     # ... business logic here ...
     client = Faraday.new("https://mailing.list") do |conn|
       conn.use(InstrumentationMiddleware.new)
       conn.adapter Faraday.default_adapter
     end
     client.post("/api/contacts", contact)
+    logger.info(
+      "Contact synced",
+      "event.name" => "app.contact.synced", # <== Contact Synced Event
+      "duration" => Time.now - start_time,  # <== Duration
+    )
   end
  
   class InstrumentationMiddleware < Faraday::Middleware
     def call(env)
       @app.call(request).on_complete do |response|
         logger.info(
           "Request to #{request[:url]} completed with status code #{response[:status]} and duration #{response[:duration_ms]}ms",
           "event.name" => "http.request.sent",
           "http.request.url" => request[:url],
           "http.request.method" => request[:method],
           "http.response.status" => response[:status]
         )
       end
     end
   end
 end

Test passes.

HTTP Request Sent Event
- URL
- Method
- Status Code
Contact Synced Event
- Duration

Gathered the data. Deployed. Time to query the data.

Step 4: Use

After deploying, I queried the logs and created some graphs.

My answers:

Q1: What percentage of contacts have been successfully synced? - 93%
Q2: How long does it take for each contact to sync? - P95 of duration is 1.3 seconds
Q3: How many contacts are being synced a hour? - ~10-50

Step 5: Reflect

Finally, I reflect on the results.

Answer 1 is concerning!

Why are 100% of contacts not being successfully synced?

Looks weird. Let's go back to step 1 and form a new question.

Cycle 2 and Beyond

After a few more cycles, here's what I discovered:

7% of contacts were failing to sync
These contacts were already unsubscribed in the mailing list software
The HTTP status code was 400 but the code was failing silently

I wrote a failing test fot the defect, fixed the code, deployed.

The percentage of contacts successfully synced increased to 100%.

Everything else looked good.

Finally, I had confidence my code worked in production and I declared the feature done.

Using ODD and this iterative five step process, I've discovered errors, bugs, strange race conditions, upstream issues and much more, all whilst the feature is fresh in my head and I can fix issues immediately.

Action: Build Your Next Feature With ODD

Pick one feature you're building this week. Instead of treating deployed code as done, build with ODD.

Follow the five steps:

Step 1: Question: Create questions that define a fitness function for this feature.

Step 2: Decide: Choose data to answer these questions.

Step 3: Build: Build instrumentation alongside each slice of value.

Step 4: Use: Query the instrumentation once you've deployed.

Step 5: Reflect: What's the answer? Do you need more data? Better instrumentation? Or is there a new question?

Free course offer

John is currently offering a course to teach engineers observability.

First module is FREE
Practical and hands on - minimal theory
Fix a real defect in a real app
Using Ruby on Rails, but little coding knowledge required
Takes 2-3 hours

Find out more

Resources

Observability Engineering - Achieving Production Excellence by Charity Majors, Liz Fong-Jones and George Miranda: the canonical book on Observability.

Discover more on John's' LinkTree

From the web

Articles we like from observability web sites and blogs

The Art of Kubernetes Intrusion Detection
Fatih Koç blog Oct 22, 2025
If you are an SRE, when an outage happens you will know about it pretty quick. With security breaches the picture is rather less clear as, by their nature, they are designed to go undetected. Intrusion detection therefore is often based on a mixture of tools designed to spot unusual spikes, suspicious patterns or failed logon attempts.

This article by Fatih Koç argues that one of the major difficulties involved in identifying attacks is that of correlating signals across multiple sources such as Falco, Prometheus, Kubernetes Audit Logs etc. In this article, he outlines a strategy for extracting relevant data from each of these sources and pulling it together into a single observability dashboard.
Grafana Use a Canary to Fight Intruders
Grafana blog Sept 16, 2025
The first line of cyber defence is normally at the perimeter - preventing attackers from entering your network in the first place. The next line of defence is intrusion detection. This can often take the form of anomaly detection using a variety of heuristics.

There are also some more creative possibilities, such as the canary solution adopted by Grafana. Just as the canary in the coalmine sings to alert underground workers to the presence of toxic gases, Grafana’s canary was designed to alert them to the possible presence of intruders in their domain.
Acting On Impulse - How Airbnb Do Load Testing
Airbnb Tech blog June 10, 2025
Load testing can be simple in theory but in modern distributed architectures, it involves a lot more than throwing requests at an individual service. This article on the Airbnb engineering blog looks at how the company’s engineers use the Impulse load-testing framework to handle a number of more complex requirements such as dependency mockingand managing messaging and asyncronous calls.

Unfortunately, at the moment Impulse is just an internal Airbnb framework, so you won’t be able to get your hands on it at present. At the same time, the article provides a valuable blueprint for tackling advanced, real world load testing scenarios.
It’s eBPF for Windows!
Scorpio Software blog Mar 21, 2025
It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
Inside The C++ Black Box
Elastic blog blog Mar 10, 2025
As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.
Brendan Gregg - His Latest Flame
Brendan Gregg Blog Dec 19, 2024
Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan’s latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.
System Initiative - IaC Reinvented!
System Initiative Blog Dec 19, 2024
If you have ever had to grapple with a 3,000 line Helm chart to deploy your observability infrastructure, you may be forgiven for thinking that there must be a better way to do this. Whilst YAML has a certain formal elegance, its syntax struggles to express the architectures and relationships embedded in highly complex systems.

Whilst Pulumi have tackled this problem by enabling the use of high level programming languages for IaC, System Initiative are taking a fundamentally more radical approach. Their goal is nothing other than completely reinventing IaC from the ground up. The blog article for the launch of the product is an incredibly ambitious statement of intent. The terms ‘game changer’ and ‘paradigm shift’ tend to be thrown around somewhat casually, this might be a case where their usage is appropriate.

So, what are they proposing? Well, System Initiative is IaC without the code. It is a kind of digital canvas where you manipulate digital twins of your systems. Is the future here or is this the Platform Engineering equivalent of science fiction? Read the article and decide for yourself!
How Zomato Souped Up Their Metrics With VM
Zomato Blog Sep 14, 2024
Zomato is a restaurant aggregator and food delivery service that generates vast volumes of metrics. As their company grew, they adopted a Prometheus/Thanos-based architecture - running some 144 Prometheus servers. As metrics volumes continued to skyrocket, even this architecture started to creak and the Zomato SRE team began the search for an alternative solution.

In this article on the Zomato blog, the team discuss why they opted to migrate to Victoria Metrics as well as discussing a number of features of the system which enable them to achieve better performance, lower costs and greater scalability.

The technical challenges were pretty daunting - the project involved migrating over 800 dashboards, 300 microservices and 2.2 billion active time series. We would commend this article not just for its technical insights but also for taking a warts-and-all approach in documenting some of the technical limitations of the VM solution.
Obirdability - Fowl Play With Grafana!!
Grafana Blog Jul 29, 2024
Grafana dashboards have been put to all sorts of uses over the years - for everything from space missions to monitoring milk production. In this fun but highly informative article Ivana Huckova and Sven Grossman walk us through building an observability system for bird song. Whilst this might sound slightly quirky, the techniques could be applied to all manner of applications which need to record and analyse audio inputs.

The article is a great showcase for a number of Grafana capabilities - including installing Alloy on a Raspberry Pi and adding context to Dashboard data by dynamically query sources such as Wikipaedia and the Open-Meteo weather information service.
Internal Observability at Uber
Uber Blog Jun 10, 2024
Stories about Uber architecture always seem to be interesting, not least because they always involve technology at huge scale - such as this trillion record migration from DynamoDB. This article, however, is actually interesting on a number of levels. As well of being of technical interest it also provides some fascinating insight into internal team topologies and management processes - which are also fundamentally important aspects of managing observability at scale. Whilst most organisations will only operate at a fraction of Uber’s scale, every organisation is seeking to minimise costs and improve service to users, and the article provides a number of insights which would be of interest to most observability practitioners.
Observability Principles for ML Models
Datadog Blog May 16, 2024
A survey carried out by McKinsey in 2021 found that 57% of respondents were already using Machine Learning to support at least one business function. ML is no longer a niche concern but is becoming a core component of development and CI/CD practices. As this post from the Datadog blog notes, the efficacy of ML models will inevitably degrade over time, so monitoring their performance and reliability is critical. The article really drives home the point that ML is a domain with its own specific behaviours, and effective monitoring requires building out new processes, metrics and even infrastructure to cover issues such as Data Drift, Prediction Drift and Concept Drift. Whilst the article does use some specialist terms, it is a highly readable and practical guide to the subject of ML monitoring.

Top