|
| 1 | +--- |
| 2 | +title: Tail-Based Sampling with service.criticality |
| 3 | +linkTitle: Tail Sampling |
| 4 | +--- |
| 5 | + |
| 6 | +This example demonstrates how to use the |
| 7 | +[`service.criticality`](/docs/specs/semconv/resource/service/#service) resource |
| 8 | +attribute for intelligent tail-based sampling decisions in the OpenTelemetry |
| 9 | +Collector. |
| 10 | + |
| 11 | +The demo application assigns a `service.criticality` value to each service, |
| 12 | +classifying them by operational importance: |
| 13 | + |
| 14 | +| Criticality | Sampling Rate | Services | |
| 15 | +| ----------- | ------------- | ------------------------------------------------------------------------------------------ | |
| 16 | +| `critical` | 100% | payment, checkout, frontend, frontend-proxy | |
| 17 | +| `high` | 50% | cart, product-catalog, currency, shipping | |
| 18 | +| `medium` | 10% | recommendation, ad, product-reviews, email | |
| 19 | +| `low` | 1% | accounting, fraud-detection, image-provider, load-generator, quote, flagd, flagd-ui, Kafka | |
| 20 | + |
| 21 | +## Collector Configuration |
| 22 | + |
| 23 | +To enable tail-based sampling, add the following to your |
| 24 | +`otelcol-config-extras.yml`: |
| 25 | + |
| 26 | +```yaml |
| 27 | +processors: |
| 28 | + tail_sampling: |
| 29 | + decision_wait: 10s |
| 30 | + num_traces: 100000 |
| 31 | + expected_new_traces_per_sec: 1000 |
| 32 | + policies: |
| 33 | + # Policy 1: Always sample critical services (100%) |
| 34 | + - name: critical-services-always-sample |
| 35 | + type: string_attribute |
| 36 | + string_attribute: |
| 37 | + key: service.criticality |
| 38 | + values: |
| 39 | + - critical |
| 40 | + enabled_regex_matching: false |
| 41 | + invert_match: false |
| 42 | + |
| 43 | + # Policy 2: Sample 50% of high-criticality services |
| 44 | + - name: high-criticality-probabilistic |
| 45 | + type: and |
| 46 | + and: |
| 47 | + and_sub_policy: |
| 48 | + - name: is-high-criticality |
| 49 | + type: string_attribute |
| 50 | + string_attribute: |
| 51 | + key: service.criticality |
| 52 | + values: |
| 53 | + - high |
| 54 | + - name: probabilistic-50 |
| 55 | + type: probabilistic |
| 56 | + probabilistic: |
| 57 | + sampling_percentage: 50 |
| 58 | + |
| 59 | + # Policy 3: Sample 10% of medium-criticality services |
| 60 | + - name: medium-criticality-probabilistic |
| 61 | + type: and |
| 62 | + and: |
| 63 | + and_sub_policy: |
| 64 | + - name: is-medium-criticality |
| 65 | + type: string_attribute |
| 66 | + string_attribute: |
| 67 | + key: service.criticality |
| 68 | + values: |
| 69 | + - medium |
| 70 | + - name: probabilistic-10 |
| 71 | + type: probabilistic |
| 72 | + probabilistic: |
| 73 | + sampling_percentage: 10 |
| 74 | + |
| 75 | + # Policy 4: Sample 1% of low-criticality services |
| 76 | + - name: low-criticality-probabilistic |
| 77 | + type: and |
| 78 | + and: |
| 79 | + and_sub_policy: |
| 80 | + - name: is-low-criticality |
| 81 | + type: string_attribute |
| 82 | + string_attribute: |
| 83 | + key: service.criticality |
| 84 | + values: |
| 85 | + - low |
| 86 | + - name: probabilistic-1 |
| 87 | + type: probabilistic |
| 88 | + probabilistic: |
| 89 | + sampling_percentage: 1 |
| 90 | + |
| 91 | + # Policy 5: Always sample error traces regardless of criticality |
| 92 | + - name: errors-always-sample |
| 93 | + type: status_code |
| 94 | + status_code: |
| 95 | + status_codes: |
| 96 | + - ERROR |
| 97 | + |
| 98 | + # Policy 6: Always sample slow traces from critical/high services |
| 99 | + - name: slow-critical-traces |
| 100 | + type: and |
| 101 | + and: |
| 102 | + and_sub_policy: |
| 103 | + - name: is-critical-or-high |
| 104 | + type: string_attribute |
| 105 | + string_attribute: |
| 106 | + key: service.criticality |
| 107 | + values: |
| 108 | + - critical |
| 109 | + - high |
| 110 | + - name: is-slow |
| 111 | + type: latency |
| 112 | + latency: |
| 113 | + threshold_ms: 5000 |
| 114 | + |
| 115 | +service: |
| 116 | + pipelines: |
| 117 | + traces: |
| 118 | + receivers: [otlp] |
| 119 | + processors: [resourcedetection, memory_limiter, transform, tail_sampling] |
| 120 | + exporters: [otlp, debug, spanmetrics] |
| 121 | +``` |
| 122 | +
|
| 123 | +## How It Works |
| 124 | +
|
| 125 | +The tail-sampling processor evaluates completed traces against the configured |
| 126 | +policies. A trace is sampled if **any** policy matches: |
| 127 | +
|
| 128 | +- **Critical services** are always sampled to ensure full visibility into |
| 129 | + payment flows, checkout, and user-facing services. |
| 130 | +- **High-criticality services** are sampled at 50%, balancing observability with |
| 131 | + data volume. |
| 132 | +- **Medium and low-criticality services** are progressively sampled at lower |
| 133 | + rates to reduce noise from less critical paths. |
| 134 | +- **Errors are always captured** regardless of service criticality, ensuring no |
| 135 | + issues go unnoticed. |
| 136 | +- **Slow traces** (>5s) from critical and high-criticality services are always |
| 137 | + sampled to help identify performance bottlenecks. |
0 commit comments