Prometheus and OpenMetrics monitoring for Loki Mode (v5.38.0).
Loki Mode exposes a /metrics endpoint that returns production-ready metrics in Prometheus/OpenMetrics text format. This enables integration with:
- Prometheus
- Grafana
- Datadog
- New Relic
- Elastic APM
- Any OpenMetrics-compatible monitoring system
# Enable metrics endpoint
export LOKI_METRICS_ENABLED=true
# Start Loki Mode
loki start ./prd.md
# View metrics
curl http://localhost:57374/metrics
# Or use CLI
loki metricsGET http://localhost:57374/metrics
Content-Type: text/plain; version=0.0.4
Returns metrics in OpenMetrics text format. No authentication required by default (configure reverse proxy auth for production).
| Metric | Type | Description |
|---|---|---|
loki_session_status |
gauge | Current session status: 0=stopped, 1=running, 2=paused |
loki_iteration_current |
gauge | Current iteration number |
loki_iteration_max |
gauge | Maximum configured iterations (from LOKI_MAX_ITERATIONS) |
loki_uptime_seconds |
gauge | Seconds since session started |
| Metric | Type | Labels | Description |
|---|---|---|---|
loki_tasks_total |
gauge | status |
Number of tasks by status: pending, in_progress, completed, failed |
| Metric | Type | Description |
|---|---|---|
loki_agents_active |
gauge | Number of currently active agents |
loki_agents_total |
gauge | Total number of registered agents |
| Metric | Type | Description |
|---|---|---|
loki_cost_usd |
gauge | Estimated total session cost in USD |
| Metric | Type | Description |
|---|---|---|
loki_events_total |
counter | Total number of events recorded in events.jsonl |
Metrics are derived from .loki/ flat files:
| File | Metrics |
|---|---|
dashboard-state.json |
session_status, iteration_current, iteration_max, tasks_total, agents_active |
loki.pid |
session_status (PID alive check fallback), uptime_seconds |
state/agents.json |
agents_total |
metrics/efficiency/*.json |
cost_usd |
events.jsonl |
events_total (line count) |
# Fetch all metrics
loki metrics
# Filter specific metric
loki metrics | grep loki_cost_usd
# Watch metrics in real-time
watch -n 5 loki metrics
# Custom dashboard host/port
loki metrics --host 192.168.1.100 --port 8080Add to prometheus.yml:
scrape_configs:
- job_name: 'loki-mode'
scrape_interval: 15s
static_configs:
- targets: ['localhost:57374']
labels:
environment: 'production'
project: 'my-app'scrape_configs:
- job_name: 'loki-mode'
scheme: https
tls_config:
insecure_skip_verify: true # For self-signed certs
static_configs:
- targets: ['localhost:57374']scrape_configs:
- job_name: 'loki-mode'
scheme: https
bearer_token: 'loki_xxx...'
static_configs:
- targets: ['dashboard.example.com:443']scrape_configs:
- job_name: 'loki-mode'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- loki
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: loki-mode
- source_labels: [__meta_kubernetes_pod_ip]
target_label: __address__
replacement: $1:57374- Navigate to Configuration > Data Sources
- Click "Add data source"
- Select "Prometheus"
- URL:
http://prometheus-server:9090 - Save & Test
Import the Loki Mode dashboard template or create custom panels:
- Type: Stat
- Query:
loki_session_status - Value Mappings:
- 0 = Stopped (Red)
- 1 = Running (Green)
- 2 = Paused (Yellow)
- Type: Gauge
- Query:
loki_iteration_current / loki_iteration_max * 100 - Unit: Percent (0-100)
- Thresholds: 0-50 (yellow), 50-100 (green)
- Type: Pie chart
- Query:
loki_tasks_total - Legend:
{{status}}
- Type: Time series
- Query:
loki_agents_active - Legend: Active Agents
- Type: Stat
- Query:
loki_cost_usd - Unit: Currency (USD)
- Decimals: 2
- Type: Graph
- Query:
rate(loki_events_total[5m]) - Legend: Events per second
- Type: Stat
- Query:
loki_uptime_seconds - Unit: Duration (seconds)
# Session is running
loki_session_status == 1
# Iteration progress percentage
loki_iteration_current / loki_iteration_max * 100
# Total pending + in-progress tasks
loki_tasks_total{status="pending"} + loki_tasks_total{status="in_progress"}
# Cost per hour
rate(loki_cost_usd[1h]) * 3600
# Event rate (events per minute)
rate(loki_events_total[5m]) * 60
# Task completion rate
rate(loki_tasks_total{status="completed"}[10m])
# Failed task ratio
loki_tasks_total{status="failed"} / sum(loki_tasks_total)
Create /etc/datadog-agent/conf.d/openmetrics.d/loki_mode.yaml:
instances:
- prometheus_url: http://localhost:57374/metrics
namespace: loki
metrics:
- loki_session_status
- loki_iteration_current
- loki_iteration_max
- loki_tasks_total
- loki_agents_active
- loki_agents_total
- loki_cost_usd
- loki_events_total
- loki_uptime_seconds
tags:
- environment:production
- service:loki-modeRestart Datadog Agent:
sudo systemctl restart datadog-agentView metrics in Datadog:
- Navigate to Dashboards > New Dashboard
- Add widgets with queries like
loki.session_status,loki.cost_usd - Set up monitors for cost thresholds and session failures
Create loki_alerts.yml:
groups:
- name: loki-mode
interval: 30s
rules:
- alert: LokiSessionDown
expr: loki_session_status == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Loki Mode session is not running"
description: "Session has been stopped for more than 5 minutes"
- alert: LokiBudgetWarning
expr: loki_cost_usd > 4.00
labels:
severity: warning
annotations:
summary: "Loki Mode cost approaching budget limit"
description: "Current cost: ${{ $value }}"
- alert: LokiBudgetCritical
expr: loki_cost_usd > 4.50
labels:
severity: critical
annotations:
summary: "Loki Mode cost exceeds budget"
description: "Current cost: ${{ $value }}, budget: $5.00"
- alert: LokiStagnation
expr: changes(loki_iteration_current[30m]) == 0 and loki_session_status == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Loki Mode iteration not progressing"
description: "No iteration progress in 30 minutes"
- alert: LokiHighFailureRate
expr: loki_tasks_total{status="failed"} / sum(loki_tasks_total) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High task failure rate"
description: "{{ $value | humanizePercentage }} of tasks are failing"
- alert: LokiTooManyAgents
expr: loki_agents_active > 50
for: 10m
labels:
severity: warning
annotations:
summary: "Too many active agents"
description: "{{ $value }} agents active, may indicate runaway spawning"Configure alerts in Grafana panels:
- Edit panel
- Navigate to Alert tab
- Create alert rule:
- Condition:
WHEN last() OF query(A, 5m, now) IS ABOVE 4.5 - Evaluate: Every 1m for 5m
- Send to: Slack, PagerDuty, Email
- Condition:
| Variable | Default | Description |
|---|---|---|
LOKI_METRICS_ENABLED |
false |
Enable /metrics endpoint |
LOKI_METRICS_PORT |
57374 |
Port for metrics endpoint (same as dashboard) |
LOKI_METRICS_PATH |
/metrics |
Endpoint path |
- Enable metrics in production:
export LOKI_METRICS_ENABLED=true- Secure endpoint with reverse proxy authentication
- Set up Prometheus scraping with appropriate interval (15-30s)
- Create Grafana dashboards for visualization
- Configure alerts for budget, stagnation, and failures
- Monitor metrics retention and storage
- Metrics endpoint is lightweight (reads flat files, no DB queries)
- Scrape interval of 15-30 seconds recommended
- Metrics are cached for 2 seconds to avoid excessive file reads
- No impact on Loki Mode execution performance
- Track
loki_cost_usdto prevent budget overruns - Alert on
loki_session_status == 0for unexpected stops - Monitor
loki_tasks_total{status="failed"}for quality issues - Watch
loki_agents_activefor agent spawning issues - Track
loki_iteration_currentfor progress
# Check LOKI_METRICS_ENABLED is set
echo $LOKI_METRICS_ENABLED
# Verify LOKI_DIR is set (required for dashboard)
echo $LOKI_DIR
# Check dashboard-state.json exists and is updating
ls -la .loki/dashboard-state.json
watch -n 2 cat .loki/dashboard-state.json
# Check dashboard is running
loki dashboard status
curl http://localhost:57374/health# Ensure a Loki session is running
loki status
# Check dashboard-state.json is being updated (every 2 seconds)
stat .loki/dashboard-state.json
# Verify metrics files exist
ls -la .loki/metrics/efficiency/
# Check events.jsonl exists
ls -la .loki/events.jsonl# Verify dashboard is running on expected port
curl http://localhost:57374/health
# Check if another process is using port 57374
lsof -ti:57374
# Restart dashboard
loki dashboard stop
loki dashboard start# Test endpoint manually
curl http://localhost:57374/metrics
# Check Prometheus targets page
open http://prometheus-server:9090/targets
# Verify network connectivity from Prometheus to Loki dashboard
# (firewall, security groups, etc.)
# Check Prometheus logs
kubectl logs -f prometheus-server-xyz# Set up budget alert
cat > /tmp/budget_check.sh <<'EOF'
#!/bin/bash
COST=$(curl -s http://localhost:57374/metrics | grep loki_cost_usd | awk '{print $2}')
if (( $(echo "$COST > 4.5" | bc -l) )); then
echo "CRITICAL: Cost $COST exceeds budget!"
loki stop
fi
EOF
# Run every 5 minutes
crontab -e
# Add: */5 * * * * /tmp/budget_check.shimport requests
import json
def get_loki_metrics():
response = requests.get("http://localhost:57374/metrics")
metrics = {}
for line in response.text.splitlines():
if line.startswith("loki_"):
parts = line.split()
metric_name = parts[0]
metric_value = float(parts[1]) if len(parts) > 1 else 0
metrics[metric_name] = metric_value
return metrics
metrics = get_loki_metrics()
print(json.dumps(metrics, indent=2))# Add to Prometheus Alertmanager config
cat >> /etc/alertmanager/alertmanager.yml <<EOF
receivers:
- name: slack
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#loki-alerts'
text: 'Loki Mode cost: ${{ .Annotations.description }}'
EOF- Audit Logging - Track agent actions
- Dashboard Guide - Web dashboard
- Enterprise Features - Complete enterprise guide
- Prometheus Metrics - Detailed wiki documentation