Consider SLO-based alerting with error budgets for CPU/RAM alerts

## Current State

Your alerting rules in `stacks/observability/provisioning/alerting/rules.yaml` use static threshold alerts:

- **High CPU** — fires when CPU > 80% for 5 minutes
- **High RAM** — fires when RAM > 90% for 5 minutes
- **High Disk** — fires when disk > 80% for 10 minutes

These are straightforward and easy to reason about — nothing wrong with that for a homelab. But for **CPU and RAM**, static thresholds can generate alerts that aren't actually actionable, because a spike that lasts 5 minutes but doesn't impact any running services doesn't need human attention.

## The Idea: Error Budgets for Resource Alerts

Instead of "alert me when CPU is high," you could frame it as:

> "I'm OK with CPU being above 80% for up to 1% of the time over a 30-day window. Alert me when I'm burning through that budget too fast."

This is the **SLO (Service Level Objective)** approach. The key shift:
- **Threshold alert:** "Something is high right now" → often noisy
- **SLO alert:** "At this burn rate, I'll exhaust my error budget" → actionable

A tool like [**Sloth**](https://sloth.dev) can generate multi-window, multi-burn-rate Prometheus alerting rules from a simple SLO spec. It handles the math of calculating burn rates across different time windows (5m, 30m, 1h, 6h) so you get fast alerts for severe incidents and slower alerts for gradual degradation — all from one definition.

### Example: CPU SLO with Sloth

```yaml
version: "prometheus/v1"
service: "homelab-infra"
labels:
  owner: "colin"
slos:
  - name: "cpu-not-saturated"
    objective: 99  # CPU below 80% for 99% of the time over 30 days
    sli:
      events:
        error_query: >
          1 - avg(rate(node_cpu_seconds_total{mode="idle",job="node"}[{{.window}}])) > 0.8
        total_query: "vector(1)"
    alerting:
      name: HighCPUBurnRate
      labels:
        severity: warning
      annotations:
        summary: "CPU error budget burn rate is too high"
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning
```

Running `sloth generate` on this produces Prometheus recording + alerting rules with proper multi-window burn rate detection.

## What I'd Keep as Threshold Alerts

**Disk usage** is a great candidate for staying as a static threshold. Disk doesn't self-heal — once it's full, things break. A "disk > 80%" alert with a 10-minute window is perfectly reasonable and more intuitive than an error budget for a monotonically increasing resource.

## TL;DR

| Alert | Suggested Approach | Why |
|-------|--------------------|-----|
| High CPU | SLO + error budget | Spikes are normal; only alert when budget is at risk |
| High RAM | SLO + error budget | Same — transient pressure isn't actionable |
| High Disk | Keep as threshold | Disk is finite and doesn't self-heal |

## Resources

- [Sloth](https://sloth.dev) — generates multi-window multi-burn-rate alerts from SLO specs
- [Google SRE Workbook: Alerting on SLOs](https://sre.google/workbook/alerting-on-slos/) — the foundational approach
- [Prometheus SLO alerting](https://prometheus.io/docs/practices/alerting/) — Prometheus docs on alerting best practices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider SLO-based alerting with error budgets for CPU/RAM alerts #62

Current State

The Idea: Error Budgets for Resource Alerts

Example: CPU SLO with Sloth

What I'd Keep as Threshold Alerts

TL;DR

Resources

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Alert	Suggested Approach	Why
High CPU	SLO + error budget	Spikes are normal; only alert when budget is at risk
High RAM	SLO + error budget	Same — transient pressure isn't actionable
High Disk	Keep as threshold	Disk is finite and doesn't self-heal

Consider SLO-based alerting with error budgets for CPU/RAM alerts #62

Description

Current State

The Idea: Error Budgets for Resource Alerts

Example: CPU SLO with Sloth

What I'd Keep as Threshold Alerts

TL;DR

Resources

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions