Blog | svrnm :: Severin Neumann

No need for HUMANS.md

A while back I was exploring an open source project I was using, and interested in contributing to, as it provided capabilities I needed but missed some additional ones I wanted to have. Naturally, I was looking for a CONTRIBUTING.md. The file existed, but it only contained a note, that issues and PRs are accepted. While this is key information that belongs in this file, I was missing instructions on how to set up the repository, how to run tests, and what criteria I need to meet to increase my chances of having a PR accepted. ...

Adjusting load generators for realistic traffic simulation

Throughout my jobs in the observability space, I created or contributed to various demo and sample applications, which often follow the same premise: there is a “normal state” in which the application is running, and with a trigger, it moves into a “deviated state”. For example, there is the placeOrder transaction on a webshop that performs just fine, and orders and money are flowing into our hypothetical e-commerce company. However, with the click of a button (or a CLI command), an issue is injected into the application, and the placeOrder transaction stops working as expected. Orders go down, money stops flowing, hypothetical customers get angry! ...

Reliability Is Managed In Services, But Felt In Transactions

Modern reliability practice is excellent at making complex systems operable at the service level, but users experience reliability at the level of end-to-end transactions and flows. This post explores why that gap exists and what it means for how we measure and manage reliability. Read the full article on Causely Blog

Are metrics the bestrics?

“What’s your favorite telemetry signal” has been the last question of the Humans of OpenTelemetry series for the last few years. At KubeCon EU 2024 my answer to this was “profiling, because I think this is really closing a big gap that was missing in observability”. But, today, I found out that my answer has changed, and I am leaning more toward what Vijay Samuel gave as an answer: “I feel metrics are the most powerful signal!” ...

How to Turn Slow Queries into Actionable Reliability Metrics with OpenTelemetry

Slow SQL queries degrade user experience, cause cascading failures, and turn simple operations into production incidents. Instead of collecting more telemetry for its own sake, this guide shows how to turn OpenTelemetry database spans into span-derived metrics you can actually act on. Read the full article on Causely Blog

Alerts Aren’t the Investigation

PagerDuty fires: CheckoutAPI burn rate (2m/1h). Grafana shows p99 going from ~120ms to ~900ms. Retries doubled. DB CPU is flat, but checkout pods are throttling and a downstream dependency’s error budget is evaporating. Read the full article on Causely Blog

Can you get Observability without Telemetry?

People always say there are no stupid questions, and then you read the title of this post and you’re not so sure anymore. You start to doubt my sanity, or at least suspect that I’m a troll. However, as it is with most apparently stupid questions, there is something to learn from the answer if you explore it. To spare you from reading the rest of this, the short answer is “Yes, but…” and the long answer is more of a theoretical observation with some linguistic subtleties. So if you’re not interested in that, you can leave and do something fun, otherwise don’t say I didn’t warn you! ...

Splitting out a monolith into multiple services in OpenTelemetry

I did an experiment on splitting out a monolithic application into multiple “virtual services” in OpenTelemetry to have modules visualized independently on service maps. I am not sure if this is a good idea and something you should replicate in practice, since it might violate some best practices. However, I wanted to see how I can do it. Since (as far as I know) all otel backends are only able to provide such a map/graph visualization using service.name from the resource attributes, I tried out what happens if I create one TracerProvider per module with module-specific service.* attributes. ...

What is context propagation, why do I need it, and what does it have to do with metrics?

When you’re heads-down in your own area of expertise, it’s easy to forget that what’s obvious to you might not be to others. As you might have seen in previous posts, I learned that for me using pen and paper from time to time helps uncover unknown knowns in my head. Last time, it was why the three pillars need to go. This time, it’s context propagation, and its surprising relationship to metrics. ...

Thank you, three pillars of Observability. You served us well.

I just read another post introducing traces, metrics, and logs using that analogy, which reminded me to re-share that excellent piece by Ted Young on The New Stack from a few years ago: Modern Observability Is a Single Braid of Data Ted argued the pillars are no longer load‑bearing and suggests a better framing: the “Single Braid of Data”. So let’s wheel the pillars into the museum, rope off the exhibit, and hang a small plaque: “Historic framing.” As we do with once‑cherished pillars that are no longer load‑bearing. ...