Skip to content

Consider SLO-based alerting with error budgets for CPU/RAM alerts #62

@jomcgi

Description

@jomcgi

Current State

Your alerting rules in stacks/observability/provisioning/alerting/rules.yaml use static threshold alerts:

  • High CPU — fires when CPU > 80% for 5 minutes
  • High RAM — fires when RAM > 90% for 5 minutes
  • High Disk — fires when disk > 80% for 10 minutes

These are straightforward and easy to reason about — nothing wrong with that for a homelab. But for CPU and RAM, static thresholds can generate alerts that aren't actually actionable, because a spike that lasts 5 minutes but doesn't impact any running services doesn't need human attention.

The Idea: Error Budgets for Resource Alerts

Instead of "alert me when CPU is high," you could frame it as:

"I'm OK with CPU being above 80% for up to 1% of the time over a 30-day window. Alert me when I'm burning through that budget too fast."

This is the SLO (Service Level Objective) approach. The key shift:

  • Threshold alert: "Something is high right now" → often noisy
  • SLO alert: "At this burn rate, I'll exhaust my error budget" → actionable

A tool like Sloth can generate multi-window, multi-burn-rate Prometheus alerting rules from a simple SLO spec. It handles the math of calculating burn rates across different time windows (5m, 30m, 1h, 6h) so you get fast alerts for severe incidents and slower alerts for gradual degradation — all from one definition.

Example: CPU SLO with Sloth

version: "prometheus/v1"
service: "homelab-infra"
labels:
  owner: "colin"
slos:
  - name: "cpu-not-saturated"
    objective: 99  # CPU below 80% for 99% of the time over 30 days
    sli:
      events:
        error_query: >
          1 - avg(rate(node_cpu_seconds_total{mode="idle",job="node"}[{{.window}}])) > 0.8
        total_query: "vector(1)"
    alerting:
      name: HighCPUBurnRate
      labels:
        severity: warning
      annotations:
        summary: "CPU error budget burn rate is too high"
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Running sloth generate on this produces Prometheus recording + alerting rules with proper multi-window burn rate detection.

What I'd Keep as Threshold Alerts

Disk usage is a great candidate for staying as a static threshold. Disk doesn't self-heal — once it's full, things break. A "disk > 80%" alert with a 10-minute window is perfectly reasonable and more intuitive than an error budget for a monotonically increasing resource.

TL;DR

Alert Suggested Approach Why
High CPU SLO + error budget Spikes are normal; only alert when budget is at risk
High RAM SLO + error budget Same — transient pressure isn't actionable
High Disk Keep as threshold Disk is finite and doesn't self-heal

Resources

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority/lowBacklog — nice to have, no urgency

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions