I’m Diana Todea and I’m a Senior Site Reliability Engineer. I have worked in IT for the last 14 years and as a DevOps and SRE for the last 4 years.
In a previous role I was an Observability SRE at Elastic. I got to learn all the concepts and play with o11y platforms and tools.
I have used in the past ELK and now I’m back at using Grafana and Prometheus and mainly open source tools and platforms.
It’s one of the biggest challenges in infrastructure. I think it’s important to narrow down the costs by really scrapping the cloud bill, and choosing those platforms which are open source and self-managed. There are basic alerting and monitoring that can be implemented without fancy o11y tools or platforms.
Keep different layers of monitoring. Let’s say you have a K8s environment and want to know how things are evolving and keep a step ahead any production incidents, it’s good to manage o11y at scale with dashboards, alerting coming from different tools and platforms. You can divide your alerting capabilities to keep an eye on the internal infrastructure and add another layer measuring the external or customer facing infrastructure. In this way, you make sure your tools are performant and keep ahead of any major incidents.
It really depends on each organization’s setup. There are large organizations that have myriads of tools and platforms and medium to small organizations that although their setup is modest, they face the situation to scale up on their tooling just because of their projects’ limitations. My personal take is try to have less tools and platforms and instead of deploying a plethora of alerts and dashboards, stick with the bare minimum. I mean those alerts, dashboards and SLOs that will be handy in case of a major incident at 3AM, that will help the engineer drill down to the root cause without complicating the investigative path.
Lately, in the last year I have bumped into OTEL like never before. I’ve recently joined an o11y SME team to help create the very first OTEL certificate exam. This is an on-going process, which helps me stay up to date with OTEL’s documentation progress. I also met a lot of people implementing OTEL in their orgs. OTEL is on everyone’s lips. I’m planning on introducing it at my current organization, hopefully demoing it will make everyone excited.
Personally, I think it’s a must have o11y implementation which doesn’t exclude pairing it with other o11y tools and platforms.
I really like the self-made projects some developers publish for their specific use cases and then sharing them with the community. There are some really exciting, very applicable day to day o11y projects that can be widespread. Then it’s the synthetic monitoring/RUM projects, which, although they are not as fresh on the market, they could use more implementation and adoption, if used properly. Here I’m hinting at all pro-active o11y trends (SLOs, dashboards as well) that can be used and taken advantage of before incidents happen.
Not yet. But like all my peers, I have a keen interest in it. I have played in these past few months with some AI Assistants and o11y projects. There is still a lot of improvement to be made, a lot of self-training and upskilling, but people naturally will embrace it and start learning and using it more and more.
I’m currently implementing some o11y guidelines and best practices in my daily job, I’m participating in a lot of tech conferences on the side where I meet a lot of peers with very interesting projects and use cases. My free time is very limited, so I like to use it well. On my to-do list is to play as much as possible with AI and learn some integrations and use cases with o11y.
Probably this sounds like a cliché, but I’m playing a lot with Claude AI Assistant lately. I’ve moved from ChatGPT to Claude and I’m currently assessing its capabilities, just for personal use. I cannot say it’s my favorite but right now I’m using it often. Besides this, lately I have been more focused on open source tooling like I said earlier and hopefully I can get more advanced and proficient in using them.
Play with my toddlers, which are full of energy. Also participating in tech conferences and adding aside some materials for tech blogs, which hopefully can be published soon.
I would recommend to anyone, techy or not, to stay interested and lead with their instinct. Try out as many tools as possible, build that home lab you always wanted, in any technology. Just do it. Practice gets you on your feet faster than any degree or theoretical course.
Diana Todea is a Senior Site Reliability Engineer at EQS Group. She is also a regular speaker at tech events as well as being an OpenTelemetry SME and advocate for women in technology.
Diana also has a GitHub repo where you can find slide decks for her talks as well as keep up with her latest projects.
If you would like to feature in Observability People, please get in touch!