Metrics

Track experiment performance with Prometheus or OpenTelemetry metrics to monitor experiment health, detect issues, and make data-driven decisions about rollouts.

Overview

The metrics system collects:

Invocation counts: How many times each condition is executed
Duration: How long each condition takes to execute
Outcomes: Success vs failure rates
Fallbacks: How often fallback logic triggers

Installation

dotnet add package ExperimentFramework.Metrics.Exporters

This package includes both Prometheus and OpenTelemetry exporters.

Prometheus Exporter

Basic Setup

using ExperimentFramework.Metrics.Exporters;

var prometheusMetrics = new PrometheusExperimentMetrics();

var experiments = ExperimentFrameworkBuilder.Create()
    .Trial<IDatabase>(t => t
        .UsingFeatureFlag("UseCloudDb")
        .AddControl<LocalDb>("false")
        .AddCondition<CloudDb>("true")
        .OnErrorRedirectAndReplayControl())
    .WithMetrics(prometheusMetrics)
    .UseDispatchProxy();

builder.Services.AddExperimentFramework(experiments);

var app = builder.Build();

// Expose metrics endpoint
app.MapGet("/metrics", () => prometheusMetrics.GeneratePrometheusOutput());

app.Run();

Metrics Output

# TYPE experiment_invocations_total counter
experiment_invocations_total{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 1523

# TYPE experiment_success_total counter
experiment_success_total{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 1478

# TYPE experiment_errors_total counter
experiment_errors_total{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 45

# TYPE experiment_duration_seconds histogram
experiment_duration_seconds_sum{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 45.234
experiment_duration_seconds_count{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 1523

Collected Metrics

Metric	Type	Description	Tags
`experiment_invocations_total`	Counter	Total number of invocations	`service`, `trial_key`, `method`
`experiment_success_total`	Counter	Total number of successful invocations	`service`, `trial_key`, `method`
`experiment_errors_total`	Counter	Total number of failed invocations	`service`, `trial_key`, `method`
`experiment_duration_seconds`	Histogram	Duration of invocations	`service`, `trial_key`, `method`

Tags:

service: Service interface name (e.g., "IDatabase")
trial_key: Condition key (e.g., "cloud", "local")
method: Method name (e.g., "GetDataAsync")

OpenTelemetry Exporter

Basic Setup

using ExperimentFramework.Metrics.Exporters;
using OpenTelemetry.Metrics;

var otelMetrics = new OpenTelemetryExperimentMetrics("ExperimentFramework", "1.0.0");

var experiments = ExperimentFrameworkBuilder.Create()
    .Trial<IDatabase>(t => t
        .UsingFeatureFlag("UseCloudDb")
        .AddControl<LocalDb>("false")
        .AddCondition<CloudDb>("true")
        .OnErrorRedirectAndReplayControl())
    .WithMetrics(otelMetrics)
    .UseDispatchProxy();

builder.Services.AddExperimentFramework(experiments);

// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
    .WithMetrics(metrics => metrics
        .AddMeter("ExperimentFramework") // Match meter name
        .AddPrometheusExporter());

var app = builder.Build();

app.UseOpenTelemetryPrometheusScrapingEndpoint(); // /metrics

app.Run();

Full OpenTelemetry Integration

using OpenTelemetry.Resources;
using OpenTelemetry.Metrics;

var otelMetrics = new OpenTelemetryExperimentMetrics("ExperimentFramework", "1.0.0");

var experiments = ExperimentFrameworkBuilder.Create()
    .Trial<IDatabase>(t => t
        .UsingFeatureFlag("UseCloudDb")
        .AddControl<LocalDb>("false")
        .AddCondition<CloudDb>("true")
        .OnErrorRedirectAndReplayControl())
    .WithMetrics(otelMetrics)
    .UseDispatchProxy();

builder.Services.AddExperimentFramework(experiments);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("MyApiService")
        .AddAttributes(new Dictionary<string, object>
        {
            ["environment"] = builder.Environment.EnvironmentName,
            ["version"] = "1.0.0"
        }))
    .WithMetrics(metrics => metrics
        .AddMeter("ExperimentFramework")
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri("http://otel-collector:4317");
        }));

Grafana Dashboards

Success Rate

# Overall success rate per condition
sum(rate(experiment_success_total[5m])) by (service, trial_key)
/
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)

Average Latency

# Average latency per condition
sum(rate(experiment_duration_seconds_sum[5m])) by (service, trial_key)
/
sum(rate(experiment_duration_seconds_count[5m])) by (service, trial_key)

P95 Latency

# P95 latency (requires histogram buckets)
histogram_quantile(0.95,
  sum(rate(experiment_duration_seconds_bucket[5m])) by (service, trial_key, le)
)

Throughput

# Requests per second per condition
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)

Error Rate

# Error rate per condition
sum(rate(experiment_errors_total[5m])) by (service, trial_key)
/
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)

Comparison Dashboard

# Side-by-side latency comparison
sum(rate(experiment_duration_seconds_sum{trial_key="cloud"}[5m]))
/
sum(rate(experiment_duration_seconds_count{trial_key="cloud"}[5m]))

vs

sum(rate(experiment_duration_seconds_sum{trial_key="local"}[5m]))
/
sum(rate(experiment_duration_seconds_count{trial_key="local"}[5m]))

Real-World Example

Complete setup with alerting:

using ExperimentFramework.Metrics.Exporters;

// Create metrics exporter
var prometheusMetrics = new PrometheusExperimentMetrics();

// Configure experiments
var experiments = ExperimentFrameworkBuilder.Create()
    .Trial<IDatabase>(t => t
        .UsingFeatureFlag("UseCloudDb")
        .AddControl<LocalDb>("false")
        .AddCondition<CloudDb>("true")
        .OnErrorRedirectAndReplayControl())
    .Trial<IPaymentGateway>(t => t
        .UsingFeatureFlag("UseNewPaymentGateway")
        .AddControl<StableGateway>("false")
        .AddCondition<NewGateway>("true")
        .OnErrorRedirectAndReplayControl())
    .WithMetrics(prometheusMetrics)
    .WithTimeout(TimeSpan.FromSeconds(5), TimeoutAction.FallbackToDefault)
    .WithCircuitBreaker(options =>
    {
        options.FailureRatioThreshold = 0.5;
        options.MinimumThroughput = 10;
        options.OnCircuitOpen = CircuitBreakerAction.FallbackToDefault;
    })
    .UseDispatchProxy();

builder.Services.AddExperimentFramework(experiments);

var app = builder.Build();

// Metrics endpoint
app.MapGet("/metrics", () =>
{
    var output = prometheusMetrics.GeneratePrometheusOutput();
    return Results.Text(output, "text/plain; version=0.0.4");
});

// Health check based on metrics
app.MapGet("/health", () =>
{
    // Could check if cloud condition error rate is acceptable
    return Results.Ok(new { status = "healthy" });
});

app.Run();

Prometheus Alerts

High Error Rate

# prometheus.rules.yml
groups:
  - name: experiment_framework
    rules:
      - alert: ExperimentHighErrorRate
        expr: |
          sum(rate(experiment_errors_total[5m])) by (service, trial_key)
          /
          sum(rate(experiment_invocations_total[5m])) by (service, trial_key)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Experiment {{ $labels.service }} trial {{ $labels.trial_key }} has high error rate"
          description: "Error rate is {{ $value | humanizePercentage }}"

High Latency

- alert: ExperimentHighLatency
  expr: |
    sum(rate(experiment_duration_seconds_sum[5m])) by (service, trial_key)
    /
    sum(rate(experiment_duration_seconds_count[5m])) by (service, trial_key)
    > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Experiment {{ $labels.service }} trial {{ $labels.trial_key }} has high latency"
    description: "Average latency is {{ $value }}s"

Low Traffic

- alert: ExperimentLowTraffic
  expr: |
    sum(rate(experiment_invocations_total[10m])) by (service) < 1
  for: 10m
  labels:
    severity: info
  annotations:
    summary: "Experiment {{ $labels.service }} has low traffic"
    description: "May not have enough data for statistical significance"

Best Practices

1. Add Metrics Early

Add metrics from the start of rollout:

var experiments = ExperimentFrameworkBuilder.Create()
    .Trial<IService>(t => t
        .UsingFeatureFlag("UseNewService")
        .AddControl<DefaultService>("false")
        .AddCondition<NewService>("true")
        .OnErrorRedirectAndReplayControl())
    .WithMetrics(prometheusMetrics)  // Add immediately
    .UseDispatchProxy();

2. Monitor Before Increasing Traffic

Watch metrics at low traffic levels (5-10%) before increasing:

// Start at 5%
{
  "FeatureManagement": {
    "UseNewService": {
      "EnabledFor": [
        {
          "Name": "Microsoft.Percentage",
          "Parameters": { "Value": 5 }
        }
      ]
    }
  }
}

Monitor for 24-48 hours. If metrics look good, increase to 25%, then 50%, then 100%.

3. Use Appropriate Aggregation Windows

# Too short - noisy
sum(rate(experiment_invocations_total[30s])) by (service, trial_key)

# Good - balanced
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)

# Long-term trends
sum(rate(experiment_invocations_total[1h])) by (service, trial_key)

4. Set Up Alerts

Create alerts for:

Error rate > 5%
Latency > 2x baseline
Circuit breaker opening
No traffic (experiment not running)

5. Track Business Metrics

Combine experiment metrics with business metrics:

# Conversion rate by condition
sum(rate(business_conversions_total[5m])) by (service, trial_key)
/
sum(rate(business_pageviews_total[5m])) by (service, trial_key)

Troubleshooting

No Metrics Appearing

Symptom: /metrics endpoint returns empty or missing experiment metrics.

Solutions:

Verify WithMetrics() is called before UseDispatchProxy()
Ensure experiments are actually being invoked
Check that same metrics instance is used in endpoint
Verify experiment services are being called (not bypassed)

Metrics Delayed

Symptom: Metrics lag behind actual usage.

Solutions:

Check Prometheus scrape interval (default 15s)
Verify metrics endpoint is accessible
Use shorter aggregation windows for real-time data

Incorrect Counts

Symptom: Invocation counts don't match expected traffic.

Solutions:

Remember metrics are per-instance (not aggregated across pods)
Use Prometheus sum() to aggregate across instances
Check if proxy is actually being used (not direct injection)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics

Overview

Installation

Prometheus Exporter

Basic Setup

Metrics Output

Collected Metrics

OpenTelemetry Exporter

Basic Setup

Full OpenTelemetry Integration

Grafana Dashboards

Success Rate

Average Latency

P95 Latency

Throughput

Error Rate

Comparison Dashboard

Real-World Example

Prometheus Alerts

High Error Rate

High Latency

Low Traffic

Best Practices

1. Add Metrics Early

2. Monitor Before Increasing Traffic

3. Use Appropriate Aggregation Windows

4. Set Up Alerts

5. Track Business Metrics

Troubleshooting

No Metrics Appearing

Metrics Delayed

Incorrect Counts

See Also

FilesExpand file tree

metrics.md

Latest commit

History

metrics.md

File metadata and controls

Metrics

Overview

Installation

Prometheus Exporter

Basic Setup

Metrics Output

Collected Metrics

OpenTelemetry Exporter

Basic Setup

Full OpenTelemetry Integration

Grafana Dashboards

Success Rate

Average Latency

P95 Latency

Throughput

Error Rate

Comparison Dashboard

Real-World Example

Prometheus Alerts

High Error Rate

High Latency

Low Traffic

Best Practices

1. Add Metrics Early

2. Monitor Before Increasing Traffic

3. Use Appropriate Aggregation Windows

4. Set Up Alerts

5. Track Business Metrics

Troubleshooting

No Metrics Appearing

Metrics Delayed

Incorrect Counts

See Also