Track experiment performance with Prometheus or OpenTelemetry metrics to monitor experiment health, detect issues, and make data-driven decisions about rollouts.
The metrics system collects:
- Invocation counts: How many times each condition is executed
- Duration: How long each condition takes to execute
- Outcomes: Success vs failure rates
- Fallbacks: How often fallback logic triggers
dotnet add package ExperimentFramework.Metrics.ExportersThis package includes both Prometheus and OpenTelemetry exporters.
using ExperimentFramework.Metrics.Exporters;
var prometheusMetrics = new PrometheusExperimentMetrics();
var experiments = ExperimentFrameworkBuilder.Create()
.Trial<IDatabase>(t => t
.UsingFeatureFlag("UseCloudDb")
.AddControl<LocalDb>("false")
.AddCondition<CloudDb>("true")
.OnErrorRedirectAndReplayControl())
.WithMetrics(prometheusMetrics)
.UseDispatchProxy();
builder.Services.AddExperimentFramework(experiments);
var app = builder.Build();
// Expose metrics endpoint
app.MapGet("/metrics", () => prometheusMetrics.GeneratePrometheusOutput());
app.Run();# TYPE experiment_invocations_total counter
experiment_invocations_total{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 1523
# TYPE experiment_success_total counter
experiment_success_total{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 1478
# TYPE experiment_errors_total counter
experiment_errors_total{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 45
# TYPE experiment_duration_seconds histogram
experiment_duration_seconds_sum{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 45.234
experiment_duration_seconds_count{service="IDatabase",trial_key="cloud",method="GetDataAsync"} 1523
| Metric | Type | Description | Tags |
|---|---|---|---|
experiment_invocations_total |
Counter | Total number of invocations | service, trial_key, method |
experiment_success_total |
Counter | Total number of successful invocations | service, trial_key, method |
experiment_errors_total |
Counter | Total number of failed invocations | service, trial_key, method |
experiment_duration_seconds |
Histogram | Duration of invocations | service, trial_key, method |
Tags:
service: Service interface name (e.g., "IDatabase")trial_key: Condition key (e.g., "cloud", "local")method: Method name (e.g., "GetDataAsync")
using ExperimentFramework.Metrics.Exporters;
using OpenTelemetry.Metrics;
var otelMetrics = new OpenTelemetryExperimentMetrics("ExperimentFramework", "1.0.0");
var experiments = ExperimentFrameworkBuilder.Create()
.Trial<IDatabase>(t => t
.UsingFeatureFlag("UseCloudDb")
.AddControl<LocalDb>("false")
.AddCondition<CloudDb>("true")
.OnErrorRedirectAndReplayControl())
.WithMetrics(otelMetrics)
.UseDispatchProxy();
builder.Services.AddExperimentFramework(experiments);
// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
.WithMetrics(metrics => metrics
.AddMeter("ExperimentFramework") // Match meter name
.AddPrometheusExporter());
var app = builder.Build();
app.UseOpenTelemetryPrometheusScrapingEndpoint(); // /metrics
app.Run();using OpenTelemetry.Resources;
using OpenTelemetry.Metrics;
var otelMetrics = new OpenTelemetryExperimentMetrics("ExperimentFramework", "1.0.0");
var experiments = ExperimentFrameworkBuilder.Create()
.Trial<IDatabase>(t => t
.UsingFeatureFlag("UseCloudDb")
.AddControl<LocalDb>("false")
.AddCondition<CloudDb>("true")
.OnErrorRedirectAndReplayControl())
.WithMetrics(otelMetrics)
.UseDispatchProxy();
builder.Services.AddExperimentFramework(experiments);
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource => resource
.AddService("MyApiService")
.AddAttributes(new Dictionary<string, object>
{
["environment"] = builder.Environment.EnvironmentName,
["version"] = "1.0.0"
}))
.WithMetrics(metrics => metrics
.AddMeter("ExperimentFramework")
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
}));# Overall success rate per condition
sum(rate(experiment_success_total[5m])) by (service, trial_key)
/
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)
# Average latency per condition
sum(rate(experiment_duration_seconds_sum[5m])) by (service, trial_key)
/
sum(rate(experiment_duration_seconds_count[5m])) by (service, trial_key)
# P95 latency (requires histogram buckets)
histogram_quantile(0.95,
sum(rate(experiment_duration_seconds_bucket[5m])) by (service, trial_key, le)
)
# Requests per second per condition
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)
# Error rate per condition
sum(rate(experiment_errors_total[5m])) by (service, trial_key)
/
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)
# Side-by-side latency comparison
sum(rate(experiment_duration_seconds_sum{trial_key="cloud"}[5m]))
/
sum(rate(experiment_duration_seconds_count{trial_key="cloud"}[5m]))
vs
sum(rate(experiment_duration_seconds_sum{trial_key="local"}[5m]))
/
sum(rate(experiment_duration_seconds_count{trial_key="local"}[5m]))
Complete setup with alerting:
using ExperimentFramework.Metrics.Exporters;
// Create metrics exporter
var prometheusMetrics = new PrometheusExperimentMetrics();
// Configure experiments
var experiments = ExperimentFrameworkBuilder.Create()
.Trial<IDatabase>(t => t
.UsingFeatureFlag("UseCloudDb")
.AddControl<LocalDb>("false")
.AddCondition<CloudDb>("true")
.OnErrorRedirectAndReplayControl())
.Trial<IPaymentGateway>(t => t
.UsingFeatureFlag("UseNewPaymentGateway")
.AddControl<StableGateway>("false")
.AddCondition<NewGateway>("true")
.OnErrorRedirectAndReplayControl())
.WithMetrics(prometheusMetrics)
.WithTimeout(TimeSpan.FromSeconds(5), TimeoutAction.FallbackToDefault)
.WithCircuitBreaker(options =>
{
options.FailureRatioThreshold = 0.5;
options.MinimumThroughput = 10;
options.OnCircuitOpen = CircuitBreakerAction.FallbackToDefault;
})
.UseDispatchProxy();
builder.Services.AddExperimentFramework(experiments);
var app = builder.Build();
// Metrics endpoint
app.MapGet("/metrics", () =>
{
var output = prometheusMetrics.GeneratePrometheusOutput();
return Results.Text(output, "text/plain; version=0.0.4");
});
// Health check based on metrics
app.MapGet("/health", () =>
{
// Could check if cloud condition error rate is acceptable
return Results.Ok(new { status = "healthy" });
});
app.Run();# prometheus.rules.yml
groups:
- name: experiment_framework
rules:
- alert: ExperimentHighErrorRate
expr: |
sum(rate(experiment_errors_total[5m])) by (service, trial_key)
/
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Experiment {{ $labels.service }} trial {{ $labels.trial_key }} has high error rate"
description: "Error rate is {{ $value | humanizePercentage }}"- alert: ExperimentHighLatency
expr: |
sum(rate(experiment_duration_seconds_sum[5m])) by (service, trial_key)
/
sum(rate(experiment_duration_seconds_count[5m])) by (service, trial_key)
> 5
for: 5m
labels:
severity: warning
annotations:
summary: "Experiment {{ $labels.service }} trial {{ $labels.trial_key }} has high latency"
description: "Average latency is {{ $value }}s"- alert: ExperimentLowTraffic
expr: |
sum(rate(experiment_invocations_total[10m])) by (service) < 1
for: 10m
labels:
severity: info
annotations:
summary: "Experiment {{ $labels.service }} has low traffic"
description: "May not have enough data for statistical significance"Add metrics from the start of rollout:
var experiments = ExperimentFrameworkBuilder.Create()
.Trial<IService>(t => t
.UsingFeatureFlag("UseNewService")
.AddControl<DefaultService>("false")
.AddCondition<NewService>("true")
.OnErrorRedirectAndReplayControl())
.WithMetrics(prometheusMetrics) // Add immediately
.UseDispatchProxy();Watch metrics at low traffic levels (5-10%) before increasing:
// Start at 5%
{
"FeatureManagement": {
"UseNewService": {
"EnabledFor": [
{
"Name": "Microsoft.Percentage",
"Parameters": { "Value": 5 }
}
]
}
}
}Monitor for 24-48 hours. If metrics look good, increase to 25%, then 50%, then 100%.
# Too short - noisy
sum(rate(experiment_invocations_total[30s])) by (service, trial_key)
# Good - balanced
sum(rate(experiment_invocations_total[5m])) by (service, trial_key)
# Long-term trends
sum(rate(experiment_invocations_total[1h])) by (service, trial_key)
Create alerts for:
- Error rate > 5%
- Latency > 2x baseline
- Circuit breaker opening
- No traffic (experiment not running)
Combine experiment metrics with business metrics:
# Conversion rate by condition
sum(rate(business_conversions_total[5m])) by (service, trial_key)
/
sum(rate(business_pageviews_total[5m])) by (service, trial_key)
Symptom: /metrics endpoint returns empty or missing experiment metrics.
Solutions:
- Verify
WithMetrics()is called beforeUseDispatchProxy() - Ensure experiments are actually being invoked
- Check that same metrics instance is used in endpoint
- Verify experiment services are being called (not bypassed)
Symptom: Metrics lag behind actual usage.
Solutions:
- Check Prometheus scrape interval (default 15s)
- Verify metrics endpoint is accessible
- Use shorter aggregation windows for real-time data
Symptom: Invocation counts don't match expected traffic.
Solutions:
- Remember metrics are per-instance (not aggregated across pods)
- Use Prometheus
sum()to aggregate across instances - Check if proxy is actually being used (not direct injection)
- Telemetry - OpenTelemetry distributed tracing
- Timeout Enforcement - Track timeout rates
- Circuit Breaker - Monitor circuit state