Current State
Your alerting rules in stacks/observability/provisioning/alerting/rules.yaml use static threshold alerts:
- High CPU — fires when CPU > 80% for 5 minutes
- High RAM — fires when RAM > 90% for 5 minutes
- High Disk — fires when disk > 80% for 10 minutes
These are straightforward and easy to reason about — nothing wrong with that for a homelab. But for CPU and RAM, static thresholds can generate alerts that aren't actually actionable, because a spike that lasts 5 minutes but doesn't impact any running services doesn't need human attention.
The Idea: Error Budgets for Resource Alerts
Instead of "alert me when CPU is high," you could frame it as:
"I'm OK with CPU being above 80% for up to 1% of the time over a 30-day window. Alert me when I'm burning through that budget too fast."
This is the SLO (Service Level Objective) approach. The key shift:
- Threshold alert: "Something is high right now" → often noisy
- SLO alert: "At this burn rate, I'll exhaust my error budget" → actionable
A tool like Sloth can generate multi-window, multi-burn-rate Prometheus alerting rules from a simple SLO spec. It handles the math of calculating burn rates across different time windows (5m, 30m, 1h, 6h) so you get fast alerts for severe incidents and slower alerts for gradual degradation — all from one definition.
Example: CPU SLO with Sloth
version: "prometheus/v1"
service: "homelab-infra"
labels:
owner: "colin"
slos:
- name: "cpu-not-saturated"
objective: 99 # CPU below 80% for 99% of the time over 30 days
sli:
events:
error_query: >
1 - avg(rate(node_cpu_seconds_total{mode="idle",job="node"}[{{.window}}])) > 0.8
total_query: "vector(1)"
alerting:
name: HighCPUBurnRate
labels:
severity: warning
annotations:
summary: "CPU error budget burn rate is too high"
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Running sloth generate on this produces Prometheus recording + alerting rules with proper multi-window burn rate detection.
What I'd Keep as Threshold Alerts
Disk usage is a great candidate for staying as a static threshold. Disk doesn't self-heal — once it's full, things break. A "disk > 80%" alert with a 10-minute window is perfectly reasonable and more intuitive than an error budget for a monotonically increasing resource.
TL;DR
| Alert |
Suggested Approach |
Why |
| High CPU |
SLO + error budget |
Spikes are normal; only alert when budget is at risk |
| High RAM |
SLO + error budget |
Same — transient pressure isn't actionable |
| High Disk |
Keep as threshold |
Disk is finite and doesn't self-heal |
Resources
Current State
Your alerting rules in
stacks/observability/provisioning/alerting/rules.yamluse static threshold alerts:These are straightforward and easy to reason about — nothing wrong with that for a homelab. But for CPU and RAM, static thresholds can generate alerts that aren't actually actionable, because a spike that lasts 5 minutes but doesn't impact any running services doesn't need human attention.
The Idea: Error Budgets for Resource Alerts
Instead of "alert me when CPU is high," you could frame it as:
This is the SLO (Service Level Objective) approach. The key shift:
A tool like Sloth can generate multi-window, multi-burn-rate Prometheus alerting rules from a simple SLO spec. It handles the math of calculating burn rates across different time windows (5m, 30m, 1h, 6h) so you get fast alerts for severe incidents and slower alerts for gradual degradation — all from one definition.
Example: CPU SLO with Sloth
Running
sloth generateon this produces Prometheus recording + alerting rules with proper multi-window burn rate detection.What I'd Keep as Threshold Alerts
Disk usage is a great candidate for staying as a static threshold. Disk doesn't self-heal — once it's full, things break. A "disk > 80%" alert with a 10-minute window is perfectly reasonable and more intuitive than an error budget for a monotonically increasing resource.
TL;DR
Resources