Observability People

Diana Todea

1. Tell us a bit about your role.

I’m Diana Todea and I’m a Senior Site Reliability Engineer. I have worked in IT for the last 14 years and as a DevOps and SRE for the last 4 years.

2. How did you get into observability?

In a previous role I was an Observability SRE at Elastic. I got to learn all the concepts and play with o11y platforms and tools.

3. Which observability platform/tooling do you use?

I have used in the past ELK and now I’m back at using Grafana and Prometheus and mainly open source tools and platforms.

4. What are your strategies for managing observability at scale?

It’s one of the biggest challenges in infrastructure. I think it’s important to narrow down the costs by really scrapping the cloud bill, and choosing those platforms which are open source and self-managed. There are basic alerting and monitoring that can be implemented without fancy o11y tools or platforms.

5. What is your experience with OpenTelemetry?

Keep different layers of monitoring. Let’s say you have a K8s environment and want to know how things are evolving and keep a step ahead any production incidents, it’s good to manage o11y at scale with dashboards, alerting coming from different tools and platforms. You can divide your alerting capabilities to keep an eye on the internal infrastructure and add another layer measuring the external or customer facing infrastructure. In this way, you make sure your tools are performant and keep ahead of any major incidents.

6. How you keep up with the rapidly changing observability landscape?

It really depends on each organization’s setup. There are large organizations that have myriads of tools and platforms and medium to small organizations that although their setup is modest, they face the situation to scale up on their tooling just because of their projects’ limitations. My personal take is try to have less tools and platforms and instead of deploying a plethora of alerts and dashboards, stick with the bare minimum. I mean those alerts, dashboards and SLOs that will be handy in case of a major incident at 3AM, that will help the engineer drill down to the root cause without complicating the investigative path.

7. What is your experience with OpenTelemetry?

Lately, in the last year I have bumped into OTEL like never before. I’ve recently joined an o11y SME team to help create the very first OTEL certificate exam. This is an on-going process, which helps me stay up to date with OTEL’s documentation progress. I also met a lot of people implementing OTEL in their orgs. OTEL is on everyone’s lips. I’m planning on introducing it at my current organization, hopefully demoing it will make everyone excited.

Personally, I think it’s a must have o11y implementation which doesn’t exclude pairing it with other o11y tools and platforms.

8. What do you think are the most important current trends in observability?

I really like the self-made projects some developers publish for their specific use cases and then sharing them with the community. There are some really exciting, very applicable day to day o11y projects that can be widespread. Then it’s the synthetic monitoring/RUM projects, which, although they are not as fresh on the market, they could use more implementation and adoption, if used properly. Here I’m hinting at all pro-active o11y trends (SLOs, dashboards as well) that can be used and taken advantage of before incidents happen.

9. Has AI changed your role?

Not yet. But like all my peers, I have a keen interest in it. I have played in these past few months with some AI Assistants and o11y projects. There is still a lot of improvement to be made, a lot of self-training and upskilling, but people naturally will embrace it and start learning and using it more and more.

10. How you keep up with the rapidly changing observability landscape

I’m currently implementing some o11y guidelines and best practices in my daily job, I’m participating in a lot of tech conferences on the side where I meet a lot of peers with very interesting projects and use cases. My free time is very limited, so I like to use it well. On my to-do list is to play as much as possible with AI and learn some integrations and use cases with o11y.

11. Your favourite tool (can be any IT tool, not just observability-related)?

Probably this sounds like a cliché, but I’m playing a lot with Claude AI Assistant lately. I’ve moved from ChatGPT to Claude and I’m currently assessing its capabilities, just for personal use. I cannot say it’s my favorite but right now I’m using it often. Besides this, lately I have been more focused on open source tooling like I said earlier and hopefully I can get more advanced and proficient in using them.

12. What do you do when you are not in front of a computer?

Play with my toddlers, which are full of energy. Also participating in tech conferences and adding aside some materials for tech blogs, which hopefully can be published soon.

13. Anything else you would like to add?

I would recommend to anyone, techy or not, to stay interested and lead with their instinct. Try out as many tools as possible, build that home lab you always wanted, in any technology. Just do it. Practice gets you on your feet faster than any degree or theoretical course.

Diana Todea is a Senior Site Reliability Engineer at EQS Group. She is also a regular speaker at tech events as well as being an OpenTelemetry SME and advocate for women in technology.

Diana also has a GitHub repo where you can find slide decks for her talks as well as keep up with her latest projects.

If you would like to feature in Observability People, please get in touch!

From the web

Articles we like from observability web sites and blogs

Obirdability - Fowl Play With Grafana!!
Grafana Blog Jul 29, 2024
Grafana dashboards have been put to all sorts of uses over the years - for everything from space missions to monitoring milk production. In this fun but highly informative article Ivana Huckova and Sven Grossman walk us through building an observability system for bird song. Whilst this might sound slightly quirky, the techniques could be applied to all manner of applications which need to record and analyse audio inputs.

The article is a great showcase for a number of Grafana capabilities - including installing Alloy on a Raspberry Pi and adding context to Dashboard data by dynamically query sources such as Wikipaedia and the Open-Meteo weather information service.
Internal Observability at Uber
Uber Blog Jun 10, 2024
Stories about Uber architecture always seem to be interesting, not least because they always involve technology at huge scale - such as this trillion record migration from DynamoDB. This article, however, is actually interesting on a number of levels. As well of being of technical interest it also provides some fascinating insight into internal team topologies and management processes - which are also fundamentally important aspects of managing observability at scale. Whilst most organisations will only operate at a fraction of Uber’s scale, every organisation is seeking to minimise costs and improve service to users, and the article provides a number of insights which would be of interest to most observability practitioners.
Observability Principles for ML Models
Datadog Blog May 16, 2024
A survey carried out by McKinsey in 2021 found that 57% of respondents were already using Machine Learning to support at least one business function. ML is no longer a niche concern but is becoming a core component of development and CI/CD practices. As this post from the Datadog blog notes, the efficacy of ML models will inevitably degrade over time, so monitoring their performance and reliability is critical. The article really drives home the point that ML is a domain with its own specific behaviours, and effective monitoring requires building out new processes, metrics and even infrastructure to cover issues such as Data Drift, Prediction Drift and Concept Drift. Whilst the article does use some specialist terms, it is a highly readable and practical guide to the subject of ML monitoring.
Observing Observe with Observe
Observe Apr 13, 2024
It sounds like it could be a sub-plot in the film Inception, but this is a really interesting article from the Observe blog on how they use an instance of their Observe system to monitor their Observe cloud platform. Observe not only have to support fast reads for complex user queries, they also have to support ingesting one petabyte of telemetry per day. As you can see from the above diagram, Kafka and Snowflake form two of the pillars of the backend architecture. This three-part series offers a fascinating insight into Observe’s own internal observability strategy as well as being a great exemplar of the eat your own dog food principle. This is an article which is of great value to anybody with an interest in large-scale observability architectures.
The $1m Line of Code
InfoQ Apr 5, 2024
Most of us have experienced the anguish of bill shock at some point. Being hit with a huge bill for mobile roaming charges on return from your holiday or getting a penalty notice for an inadvertent motoring infringement that happened weeks back. Those are just small pinpricks though, compared to the 50,000 volts of financial burn felt by companies mentioned in this transcript of a scintillating talk by Erik Peterson, CEO of CloudZero. He argues, persuasively, that engineering decisions are buying decisions. In the case mentioned in the headline, a decision to turn on one section of debug code led to vast volumes of logs being emitted and racking up over $1m in costs.
An Engineer's Personal Retrospective'
CEP Mar 9, 2024
This is a really engaging blog post by Infrastructure Engineer Jack Lindamood, where he reviews nearly every infrastructure decision he made over four years working at a start-up. Each choice is graded with a Regret, Endorse or an occasional Unsure. Whilst not explicitly observability-related, it will however, have resonance for any engineer forced to make technological choices (which is probably all of us). The article contains much distilled wisdom and some strong opinions, as well as general observations on the challenges and trade-offs faced by infrastructure engineers.
Finding relationships in your data with embeddings
Medium Feb 8, 2024
The RAG pattern has really gained traction over the past year as it allows enterprises to leverage the power of LLM's to gain insights into their own data. This is a fascinating and (occasionally technical) article which details how Incident IO used vector embeddings to mine through their data and discover related incidents. The article explains the techniques involved with great clarity and provides really helpful advice on creating embeddings to find hidden patterns in your own data.
How Chik-Fil-A Run 2,800 Edge Clusters
Medium Dec 29, 2023
When you think of large corporations pushing the technology envelope, Chik-Fil-A might not be the first name to come to mind. However, the highly distributed nature of their infrastructure presents massive observability challenges, which they have met with some very impressive engineering. The scale of their task is daunting - 2,800 Edge Kubernetes clusters, tens of thousands of IoT devices and billions of MQTT messages each month. This is a really fascinating article on managing IoT observability at scale.
Production-Ready Observability Platform for AI Systems
Medium Nov 3, 2023
In this blog article, Bijit Ghosh of Deutsche Bank discusses best practices for observability across the full AI system lifecycle. He composes a custom system which knits together a range of technologies including structlog, Flask, Prometheus and Kibana as well as AI-specific tools such as MLFlow and CausalML. It’s a comprehensive article which exhibits a clear understanding of both observability and AI technologies.
Infrastructure Monitoring with the TIG Stack
CNCF Blog Sept 21, 2023
A great example of managing the complexities of Observability engineering. Jay Taylor from InfluxDB builds out a solution using the Telegraf, InfluxDB, Grafana stack.
Deploying a Kubernetes monitoring stack
rtfm July 23, 2023
An in-depth look at monitoring K8S with the increasingly popular VictoriaMetrics platform. This follows an end-to-end process from crafting your own Helm chart to configuring alert rules.
"You're overpaying for OpenTelemetry's verbosity"
rtfm Oct 10, 2023
This has really raised a few eyebrows. A forensic analysis by Nikolay Sivko of coroot on how just a few OpenTel meta tags can potentially explode your ingestion fees.