As I mentioned before, many software products are expected to offer an alerting mechanism so that users can get updates on events happening in their app or other important business KPIs. This is a common feature, but it’s still complex to build.
In our solution, we wanted to achieve three things:
In order to do this, we turned to open source: we leveraged the Alerts Manager module from Prometheus. Prometheus is an open-source, industry standard for monitoring and alerting designed for tracking the performance and health of applications and infrastructure. Prometheus collects metrics from various sources and offers a flexible query language for analyzing and visualizing the data. It’s one of the most common backends for collecting OTel metrics, and we already had Prometheus in our backend to support the metrics collection.
We relied on an open source tool like Prometheus to do the legwork for us because such solutions were built by dozens of smart and experienced developers who worked on them for years, adapted them to support many use cases, and have already gone through all (or at least most) of the pitfalls in that domain. We had an internal discussion about the alerts mechanism design, and the idea to leverage Prometheus was brought by some members of the team based on their previous experience with it.
Setting an alert based distributed tracing data – powered by the Prometheus Alert Manager; this label can be accessed in the Helios Sandbox
An example of how the different alerts from the Helios Sandbox are configured in Prometheus
With Prometheus in hand, we started to work on adding the alerting mechanism. We wanted to start by alerting on traces, or more accurately on spans (e.g., the result of an HTTP request or a DB query). Prometheus offers alerts on metrics, but we needed alerts on traces. The data from the traces doesn’t get to Prometheus as-is – it needs to be converted into its data model. So in order to have Prometheus actually alert over spans, we needed to take a span, convert it to a metric, and configure an alert that is triggered by it. When a trace (span) matched an alert condition – for example a DB query taking longer than 5 seconds – we converted the span into a Prometheus metric.
The Prometheus model fit what we were aiming to achieve. For every event, we get the raw data from OTel and we feed it through Prometheus as a metric. We can then say, for example, that if a specific operation error occurs more than three times in five minutes, an alert should be activated.
We did not stop there. In Helios, a major benefit for our users is that we can go from distributed tracing data to a metric – but also back from a metric to the specific trace, because we maintain the context of the metric. Users can set trace-based alerts, and then go back from the alert to the E2E flow for fast root cause analysis. This gives users the ultimate visibility into the performance and health of their applications. The available context (based on the instrumented data) helps users easily pinpoint issues and bottlenecks in their app flows for quick troubleshooting and accelerated mean-time-to-resolution (MTTR).
Related: [How we use trace-based alerts to reduce MTTR](http://How we use trace-based alerts to reduce MTTR)
In our alerting mechanism, what we built was designed to alert on behaviors that can be defined on tracing data, such as a failed HTTP request made by service A to service B, a MongoDB query to a specific collection that took more than 500 ms, or a Lambda function invocation failure.