Skip to content

Latest commit

 

History

History
704 lines (540 loc) · 29 KB

File metadata and controls

704 lines (540 loc) · 29 KB

Prometheus Metrics

Breakglass exposes comprehensive Prometheus metrics for monitoring system health, performance, and audit trails. Metrics are registered with the controller-runtime metrics registry and available at the /metrics endpoint on the metrics port (default 8081).

Access Metrics:

curl http://breakglass.example.com:8081/metrics

Prometheus Scrape Configuration:

scrape_configs:
  - job_name: 'breakglass'
    static_configs:
      - targets: ['breakglass.example.com:8081']
    metrics_path: '/metrics'
    bearer_token: '<bearer-token>'  # If authentication required

Webhook Metrics

These metrics track authorization webhook activity and decisions.

Request Volume

Metric Type Labels Description
breakglass_webhook_sar_requests_total Counter cluster Total SubjectAccessReview requests received
breakglass_webhook_sar_requests_by_action_total Counter cluster, verb, api_group, resource, namespace, subresource SAR requests grouped by action (verb, resource, namespace)

Example Queries:

# Requests per cluster
sum(rate(breakglass_webhook_sar_requests_total[5m])) by (cluster)

# Requests by action (e.g., get pod requests)
breakglass_webhook_sar_requests_by_action_total{verb="get", resource="pods"}

Authorization Decisions

Metric Type Labels Description
breakglass_webhook_sar_allowed_total Counter cluster SAR requests allowed by webhook
breakglass_webhook_sar_denied_total Counter cluster SAR requests denied by webhook
breakglass_webhook_sar_decisions_by_action_total Counter cluster, verb, api_group, resource, namespace, subresource, decision, deny_source Decisions (allowed/denied) by action and deny source

Example Queries:

# Allow/deny ratio
sum(rate(breakglass_webhook_sar_allowed_total[5m])) by (cluster) 
/
sum(rate(breakglass_webhook_sar_denied_total[5m])) by (cluster)

# Deny sources (e.g., "policy", "no_session")
sum(rate(breakglass_webhook_sar_decisions_by_action_total{decision="denied"}[5m])) by (deny_source)

Session-Based Authorization

Metric Type Labels Description
breakglass_webhook_session_sar_allowed_total Counter cluster Session grants that allowed access
breakglass_webhook_session_sar_denied_total Counter cluster Session grants that denied access
breakglass_webhook_session_sar_errors_total Counter cluster Errors checking session grants
breakglass_webhook_session_sars_skipped_total Counter cluster Session checks skipped (e.g., due to config errors)

Session Activity Tracking

Activity tracking records when sessions are actively used by the authorization webhook. Activity data is buffered and flushed periodically (default 30s) to reduce API server load. The lastActivity and activityCount fields on BreakglassSessionStatus are updated on each flush cycle via an optimistic-concurrency status merge-patch (retry-on-conflict). Failed flushes are re-queued with merge logic (up to 5 retries).

Metric Type Labels Description
breakglass_session_activity_requests_total Counter cluster, granted_group Authorization requests that matched a breakglass session (bounded by granted group, not session name)
breakglass_session_activity_flushes_total Counter Activity tracker flush cycles completed
breakglass_session_activity_flush_errors_total Counter Failed activity status updates during flush
breakglass_session_activity_dropped_total Counter Activity entries dropped due to tracker capacity limit
breakglass_session_activity_buffer_size Gauge Number of sessions with buffered activity entries awaiting flush
breakglass_session_idle_expired_total Counter cluster Sessions automatically expired due to idle timeout

Example Queries:

# Activity rate by granted group
sum by (granted_group) (rate(breakglass_session_activity_requests_total[5m]))

# Activity rate per cluster
sum by (cluster) (rate(breakglass_session_activity_requests_total[5m]))

# Flush error rate
rate(breakglass_session_activity_flush_errors_total[5m])

# Idle expiration rate per cluster
sum by (cluster) (rate(breakglass_session_idle_expired_total[5m]))

Session-Based Authorization (Example Queries)

# Success rate of session grant checks
sum(rate(breakglass_webhook_session_sar_allowed_total[5m]))
/
(
  sum(rate(breakglass_webhook_session_sar_allowed_total[5m]))
  + sum(rate(breakglass_webhook_session_sar_denied_total[5m]))
)

# Error rate
sum(rate(breakglass_webhook_session_sar_errors_total[5m]))

SAR Processing Phase Timing

These metrics track the time spent in each phase of SubjectAccessReview processing, enabling detailed performance analysis and bottleneck identification.

Metric Type Labels Description
breakglass_webhook_sar_phase_duration_seconds Histogram cluster, phase Duration of each SAR processing phase

Processing Phases:

Phase Description
parse JSON request unmarshaling
cluster_config ClusterConfig lookup
sessions Load user groups and sessions
debug_session Early debug session check
deny_policy DenyPolicy evaluation
rbac_check canDoFn RBAC verification (when applicable)
session_sars Session authorization checks
escalations Escalation discovery
total Complete request duration

Example Queries:

# Average time per phase
avg(rate(breakglass_webhook_sar_phase_duration_seconds_sum[5m]))
/
avg(rate(breakglass_webhook_sar_phase_duration_seconds_count[5m]))
  by (cluster, phase)

# Identify slowest phase (p95)
histogram_quantile(0.95, 
  rate(breakglass_webhook_sar_phase_duration_seconds_bucket[5m])
) by (phase)

# Total SAR processing time by cluster
histogram_quantile(0.99, 
  rate(breakglass_webhook_sar_phase_duration_seconds_bucket{phase="total"}[5m])
) by (cluster)

# Compare session lookup vs RBAC check duration
histogram_quantile(0.95,
  rate(breakglass_webhook_sar_phase_duration_seconds_bucket{phase=~"sessions|rbac_check"}[5m])
) by (phase)

Session Lifecycle Metrics

Track breakglass session creation, state changes, and expiration.

Metric Type Labels Description
breakglass_session_created_total Counter cluster Sessions created
breakglass_session_updated_total Counter cluster Session status updates (approve/reject/etc)
breakglass_session_deleted_total Counter cluster Sessions deleted
breakglass_session_expired_total Counter cluster Sessions expired automatically (time-based)

Example Queries:

# Approval rate
sum(rate(breakglass_session_created_total[1h]))
# (created sessions per hour)

# Session churn
sum(rate(breakglass_session_expired_total[5m])) by (cluster)
/ 
sum(rate(breakglass_session_created_total[5m])) by (cluster)

# Growth of active sessions (approximate)
sum(increase(breakglass_session_created_total[1d])) 
- 
sum(increase(breakglass_session_expired_total[1d]))

Mail Notification Metrics

Track success/failure of email notifications sent to approvers and requesters.

Metric Type Labels Description
breakglass_mail_send_success_total Counter host Successfully sent emails
breakglass_mail_send_failure_total Counter host Failed email sends

Example Queries:

# Mail delivery success rate
sum(rate(breakglass_mail_send_success_total[5m]))
/
(
  sum(rate(breakglass_mail_send_success_total[5m]))
  + sum(rate(breakglass_mail_send_failure_total[5m]))
)

# Failed sends by mail server
breakglass_mail_send_failure_total

API Endpoint Metrics

Track frontend and REST API usage with dedicated counters and histograms. All Breakglass session and escalation REST endpoints now emit these metrics automatically through a shared instrumentation wrapper, so create/read/update paths show up in dashboards without manual bookkeeping.

Metric Type Labels Description
breakglass_api_endpoint_requests_total Counter endpoint Total requests routed through a given API handler (e.g., handleGetEscalations, handleRequestBreakglassSession, getIdentityProvider)
breakglass_api_endpoint_errors_total Counter endpoint, status_code Error responses grouped by handler and HTTP status
breakglass_api_endpoint_duration_seconds Histogram endpoint Request latency buckets (10ms to 1s) per handler

Example Queries:

# Error rate per endpoint
sum(rate(breakglass_api_endpoint_errors_total[5m])) by (endpoint)
/
sum(rate(breakglass_api_endpoint_requests_total[5m])) by (endpoint)

# 95th percentile latency for the escalations API
histogram_quantile(
  0.95,
  sum by (le) (rate(breakglass_api_endpoint_duration_seconds_bucket{endpoint="handleGetEscalations"}[5m]))
)

**Session Endpoint Labels:**

| Endpoint Label | Description |
|----------------|-------------|
| `handleGetBreakglassSessionStatus` | GET `/api/breakglassSessions` list endpoint |
| `handleGetBreakglassSessionByName` | GET `/api/breakglassSessions/:name` detail endpoint |
| `handleRequestBreakglassSession` | POST create session |
| `handleApproveBreakglassSession` | POST `:name/approve` |
| `handleRejectBreakglassSession` | POST `:name/reject` |
| `handleWithdrawMyRequest` | POST `:name/withdraw` |
| `handleDropMySession` | POST `:name/drop` |
| `handleApproverCancel` | POST `:name/cancel` |
| `handleGetEscalations` | GET breakglassEscalations list |

ClusterConfig Validation Metrics

Monitor the health of cluster configurations.

Metric Type Labels Description
breakglass_clusterconfigs_checked_total Counter cluster ClusterConfig validations performed
breakglass_clusterconfigs_failed_total Counter cluster ClusterConfig validations that failed

Example Queries:

# Config health per cluster
sum(rate(breakglass_clusterconfigs_checked_total[5m])) by (cluster)
- 
sum(rate(breakglass_clusterconfigs_failed_total[5m])) by (cluster)

# Failure rate
sum(rate(breakglass_clusterconfigs_failed_total[5m])) by (cluster)
/
sum(rate(breakglass_clusterconfigs_checked_total[5m])) by (cluster)

Pod Security Evaluation Metrics

Track risk-based pod security evaluation for exec/attach/portforward operations. See DenyPolicy - Pod Security Rules for configuration.

Metric Type Labels Description
breakglass_pod_security_evaluations_total Counter cluster, policy, action Total evaluations (action: allowed/denied/warned)
breakglass_pod_security_risk_score Histogram cluster Distribution of calculated risk scores
breakglass_pod_security_factors_total Counter cluster, factor Count of detected risk factors (e.g., hostNetwork, privilegedContainer)
breakglass_pod_security_denied_total Counter cluster, policy Exec/attach requests denied by security policy
breakglass_pod_security_warnings_total Counter cluster, policy Exec/attach requests allowed with security warnings

Example Queries:

# Deny rate by policy
sum(rate(breakglass_pod_security_denied_total[5m])) by (policy)

# Average risk score by cluster
histogram_quantile(0.50, sum(rate(breakglass_pod_security_risk_score_bucket[5m])) by (le, cluster))

# Most common risk factors
topk(5, sum(rate(breakglass_pod_security_factors_total[5m])) by (factor))

# Warning vs denial ratio
sum(rate(breakglass_pod_security_warnings_total[5m])) by (cluster)
/
sum(rate(breakglass_pod_security_denied_total[5m])) by (cluster)

Risk Factor Labels:

Factor Label Description
hostNetwork Pod uses host network namespace
hostPID Pod uses host PID namespace
hostIPC Pod uses host IPC namespace
privilegedContainer Container runs in privileged mode
hostPathWritable Pod has writable hostPath mounts
hostPathReadOnly Pod has read-only hostPath mounts
runAsRoot Container runs as UID 0
capability:* Linux capability detected (e.g., capability:SYS_ADMIN)

Cluster Circuit Breaker Metrics

These metrics track the per-cluster circuit breaker that protects against cascading failures when spoke clusters become unreachable. See Circuit Breaker for feature documentation.

Metric Type Labels Description
breakglass_cluster_circuit_breaker_state Gauge cluster Current state: 0 = Closed, 1 = Open, 2 = Half-Open
breakglass_cluster_circuit_breaker_rejections_total Counter cluster Requests rejected because the circuit was open
breakglass_cluster_circuit_breaker_state_transitions_total Counter cluster, from, to State transitions (e.g., closed→open)
breakglass_cluster_circuit_breaker_failures_total Counter cluster Transient failures recorded (network errors, timeouts, 5xx)
breakglass_cluster_circuit_breaker_successes_total Counter cluster Successful operations recorded
breakglass_cluster_circuit_breaker_consecutive_failures Gauge cluster Current consecutive-failure count (resets on success)

Example Queries:

# Clusters currently in Open state
breakglass_cluster_circuit_breaker_state == 1

# Rejection rate per cluster (requests failing without reaching the spoke)
sum by (cluster) (rate(breakglass_cluster_circuit_breaker_rejections_total[5m]))

# State transition frequency
sum by (from, to) (rate(breakglass_cluster_circuit_breaker_state_transitions_total[5m]))

# Clusters with rising consecutive failures (approaching threshold)
breakglass_cluster_circuit_breaker_consecutive_failures > 2

Alerting Recommendations

Use these alert rules to monitor system health:

groups:
  - name: breakglass-alerts
    rules:
      # High webhook request latency
      - alert: BreakglassWebhookLatency
        expr: histogram_quantile(0.99, breakglass_webhook_sar_duration_seconds) > 1
        for: 5m
        annotations:
          summary: "Breakglass webhook latency is high"

      # High deny rate
      - alert: BreakglassHighDenyRate
        expr: |
          sum(rate(breakglass_webhook_sar_denied_total[5m])) by (cluster) 
          / 
          sum(rate(breakglass_webhook_sar_requests_total[5m])) by (cluster) > 0.5
        for: 10m
        annotations:
          summary: "High authorization denial rate on cluster {{ $labels.cluster }}"

      # Mail delivery failures
      - alert: BreakglassMailFailures
        expr: |
          sum(rate(breakglass_mail_send_failure_total[5m])) by (host) > 0.05
        for: 15m
        annotations:
          summary: "Mail delivery failures from {{ $labels.host }}"

      # Session SAR errors
      - alert: BreakglassSessionSARErrors
        expr: |
          sum(rate(breakglass_webhook_session_sar_errors_total[5m])) > 0.1
        for: 10m
        annotations:
          summary: "Session SAR check errors detected"

      # Cluster config failures
      - alert: BreakglassClusterConfigError
        expr: |
          sum(rate(breakglass_clusterconfigs_failed_total[5m])) by (cluster) > 0
        for: 5m
        annotations:
          summary: "ClusterConfig validation errors on {{ $labels.cluster }}"

Dashboard Recommendations

Consider creating Grafana dashboards with these panels:

Overview Dashboard:

  • Webhook requests per cluster (rate)
  • Allow/deny decision pie chart
  • Session lifecycle (created, expired, approved per day)
  • Mail delivery success rate

Operations Dashboard:

  • Denial rate trends (alert on spikes)
  • Session approval time distribution
  • Webhook latency percentiles (p50, p95, p99)
  • ClusterConfig health per cluster

Audit Dashboard:

  • Sessions created per cluster (daily)
  • Sessions by approver
  • High-frequency denials (potential issues)
  • Failed mail notifications

Metrics Retention & Cardinality

Cardinality Considerations:

  • Webhook SAR metrics include namespace and subresource labels which may have high cardinality in large clusters
  • Session SAR metrics use the cluster label for per-cluster monitoring and alerting
  • Consider using label relabeling in Prometheus to drop high-cardinality labels if needed

Example Prometheus relabeling:

metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'breakglass_webhook_sar_requests_by_action_total'
    target_label: __tmp_cardinality
  - source_labels: [__tmp_cardinality]
    regex: '.+'
    action: drop_labels
    labels: [namespace]  # Drop namespace label to reduce cardinality

IdentityProvider Lifecycle Metrics

Monitor the health and performance of identity provider configuration reloading and OIDC authentication.

Configuration Reload Performance

Metric Type Labels Description
breakglass_identity_provider_reload_duration_seconds Histogram provider_type Duration of IdentityProvider configuration reloads (buckets: 0.1s to 60s)
breakglass_identity_provider_reload_attempts_total Counter status, provider_type Total reload attempts by status (success, error, skipped)
breakglass_identity_provider_last_reload_timestamp_seconds Gauge provider_type Unix timestamp of last successful reload per provider type

Example Queries:

# Configuration reload latency (p95)
histogram_quantile(0.95, rate(breakglass_identity_provider_reload_duration_seconds_bucket[5m]))

# Reload failure rate (should be 0 or near 0)
sum(rate(breakglass_identity_provider_reload_attempts_total{status="error"}[5m]))
/
sum(rate(breakglass_identity_provider_reload_attempts_total[5m]))

# Time since last successful reload (staleness detection)
time() - breakglass_identity_provider_last_reload_timestamp_seconds

Configuration State and Validity

Metric Type Labels Description
breakglass_identity_provider_config_version Gauge provider_type Hash of current loaded configuration (changes when config updates)
breakglass_identity_provider_status Gauge provider_name, provider_type Provider status: 1 = Active, 0 = Error, -1 = Disabled
breakglass_telemetry_init_failed Gauge Set to 1 if OpenTelemetry exporter initialization failed at startup. Only observable when --otel-required is disabled (default) since the controller continues running; when --otel-required is enabled the controller exits before the metrics endpoint becomes available.

Example Queries:

# Configuration version changes (indicates successful reloads)
changes(breakglass_identity_provider_config_version[1h])

# Active providers count
count(breakglass_identity_provider_status == 1)

# Disabled providers count
count(breakglass_identity_provider_status == -1)

# Alert on provider errors
breakglass_identity_provider_status == 0

Alert Rules

Recommended Prometheus alert rules:

groups:
  - name: breakglass_identity_provider
    interval: 30s
    rules:
      - alert: IdentityProviderReloadFailure
        expr: |
          rate(breakglass_identity_provider_reload_attempts_total{status="error"}[5m]) > 0
        for: 2m
        annotations:
          summary: "IdentityProvider reload failing"
          description: "Identity provider configuration reload has failed: {{ $value }}"

      - alert: IdentityProviderStale
        expr: |
          time() - breakglass_identity_provider_last_reload_timestamp_seconds > 900
        for: 5m
        annotations:
          summary: "IdentityProvider configuration is stale (>15m)"
          description: "No successful reload for 15+ minutes on {{ $labels.provider_type }}"

      - alert: IdentityProviderReloadSlow
        expr: |
          histogram_quantile(0.95, rate(breakglass_identity_provider_reload_duration_seconds_bucket[5m])) > 5
        for: 5m
        annotations:
          summary: "IdentityProvider reload is slow (>5s)"
          description: "p95 reload latency: {{ $value | humanizeDuration }}"

      - alert: IdentityProviderDown
        expr: |
          breakglass_identity_provider_status == 0
        for: 2m
        annotations:
          summary: "IdentityProvider {{ $labels.provider_name }} is DOWN"
          description: "Provider cannot be loaded or is in error state"

JWT & JWKS Metrics

Track JWT token validation and JWKS key fetching performance.

Metric Type Labels Description
breakglass_jwt_validation_requests_total Counter identity_provider, mode Total JWT validation attempts
breakglass_jwt_validation_success_total Counter identity_provider Successful JWT validations
breakglass_jwt_validation_failure_total Counter identity_provider, reason Failed JWT validations
breakglass_jwt_validation_duration_seconds Histogram identity_provider JWT validation latency
breakglass_jwks_cache_hits_total Counter identity_provider JWKS key cache hits
breakglass_jwks_cache_misses_total Counter identity_provider JWKS key cache misses
breakglass_jwks_fetch_requests_total Counter issuer, status JWKS endpoint fetch attempts
breakglass_jwks_fetch_duration_seconds Histogram issuer JWKS endpoint fetch latency
breakglass_jwks_cache_size Gauge issuer Number of cached JWKS key sets

Multi-IDP & OIDC Proxy Metrics

Track multi-identity-provider configuration and OIDC proxy operations.

Metric Type Labels Description
breakglass_multi_idp_config_requests_total Counter Multi-IDP config requests
breakglass_multi_idp_config_success_total Counter Successful multi-IDP config responses
breakglass_multi_idp_config_failure_total Counter reason Failed multi-IDP config responses
breakglass_idp_selector_used_total Counter IDP selector usage in session creation
breakglass_idp_selection_validations_total Counter result IDP selection validation results
breakglass_oidc_proxy_requests_total Counter endpoint OIDC proxy requests
breakglass_oidc_proxy_success_total Counter endpoint Successful OIDC proxy responses
breakglass_oidc_proxy_failure_total Counter endpoint, reason Failed OIDC proxy requests
breakglass_oidc_proxy_duration_seconds Histogram endpoint OIDC proxy request latency
breakglass_oidc_proxy_path_validation_failure_total Counter reason Rejected proxy paths
breakglass_oidc_proxy_tls_mode Gauge mode TLS mode for OIDC proxy connections

Session-IDP Association Metrics

Track which identity providers are used for session creation and approval.

Metric Type Labels Description
breakglass_session_created_with_idp_total Counter idp Sessions created via specific IDP
breakglass_session_approved_with_idp_total Counter idp Sessions approved via specific IDP
breakglass_escalation_idp_authorization_checks_total Counter result IDP authorization checks for escalations
breakglass_escalation_allowed_idps_count Gauge escalation Number of allowed IDPs per escalation

Mail Queue Metrics

Track the asynchronous mail queue and delivery pipeline.

Metric Type Labels Description
breakglass_mail_queued_total Counter Emails added to send queue
breakglass_mail_queue_dropped_total Counter Emails dropped (queue full)
breakglass_mail_sent_total Counter Emails successfully sent from queue
breakglass_mail_retry_scheduled_total Counter Emails scheduled for retry
breakglass_mail_failed_total Counter Emails permanently failed

MailProvider Metrics

Track mail provider health and email delivery.

Metric Type Labels Description
breakglass_mailprovider_configured Gauge provider Whether a mail provider is configured
breakglass_mailprovider_health_check_total Counter provider, result Health check results
breakglass_mailprovider_health_check_duration_seconds Histogram provider Health check latency
breakglass_mailprovider_status Gauge provider Provider status (1=ready, 0=not ready)
breakglass_mailprovider_emails_sent_total Counter provider Emails sent via provider
breakglass_mailprovider_emails_failed_total Counter provider Failed emails via provider

Debug Session Metrics

Track debug session lifecycle and resource usage.

Metric Type Labels Description
breakglass_debug_sessions_created_total Counter cluster, template Debug sessions created
breakglass_debug_sessions_active Gauge cluster, template Currently active debug sessions
breakglass_debug_sessions_terminated_total Counter cluster, reason Debug sessions terminated
breakglass_debug_sessions_expired_total Counter cluster, template Debug sessions expired
breakglass_debug_sessions_failed_total Counter cluster, template Debug sessions failed
breakglass_debug_session_pod_restarts_total Counter cluster, session Debug pod restarts
breakglass_debug_session_pod_failures_total Counter cluster, session, reason Debug pod failures
breakglass_debug_session_duration_seconds Histogram cluster, template Debug session duration
breakglass_debug_session_participants Gauge cluster, session Active participants per session
breakglass_debug_session_pods_deployed Gauge cluster Debug pods currently deployed
breakglass_debug_session_approval_required_total Counter cluster, template Debug sessions requiring approval
breakglass_debug_session_approved_total Counter cluster, approver_type Debug sessions approved
breakglass_debug_session_rejected_total Counter cluster, reason Debug sessions rejected

Field Index Metrics

Track field indexer registrations at startup.

Metric Type Labels Description
breakglass_index_registrations Gauge resource Number of registered field indexes

Cluster Cache Metrics

Track cluster client caching and rest config loading.

Metric Type Labels Description
breakglass_cluster_cache_hits_total Counter cluster Cluster client cache hits
breakglass_cluster_cache_misses_total Counter cluster Cluster client cache misses
breakglass_cluster_rest_config_loaded_total Counter cluster REST configs loaded
breakglass_cluster_rest_config_errors_total Counter cluster REST config load errors
breakglass_cluster_cache_invalidations_total Counter cluster Cache invalidations

Scrape Configuration Best Practices

  1. Set appropriate scrape intervals - Default 15s is usually fine, but high-volume environments may use 30s
  2. Add authentication - Use bearer tokens if the metrics endpoint requires authentication
  3. Enable compression - Consider gzip compression for large metric exports
  4. Add relabel configs - Drop unnecessary labels to reduce storage overhead
  5. Set appropriate retention - Breakglass metrics are mostly counters; 15 days retention is typical

Example production configuration:

scrape_configs:
  - job_name: 'breakglass'
    scrape_interval: 30s
    scrape_timeout: 10s
    static_configs:
      - targets: ['breakglass.example.com:8081']
    metrics_path: '/metrics'
    scheme: 'http'
    metric_relabel_configs:
      # Drop high-cardinality labels
      - source_labels: [__name__]
        regex: 'breakglass_webhook_sar_.*'
        action: drop_labels
        labels: [subresource]

Troubleshooting with Metrics

No metrics appearing:

  • Check bearer token/authentication credentials
  • Verify /metrics endpoint is accessible on port 8081
  • Check firewall rules between Prometheus and breakglass service

High denial rate:

  • Check for policy misconfigurations
  • Review DenyPolicy rules
  • Examine webhook logs for details

Mail delivery failures:

  • Check mail server connectivity via kubectl get mailproviders
  • Verify MailProvider status shows Ready
  • Check SMTP credentials secret exists and is accessible
  • Verify firewall rules to mail server
  • Review mail provider metrics (breakglass_mail_provider_*)

Session SAR errors:

  • Review ClusterConfig health
  • Check if clusters are reachable
  • Look for webhook timeout errors

For more troubleshooting guidance, see Troubleshooting Guide.