|
| 1 | +# Scenarios for Tracing semantic conventions for messaging |
| 2 | + |
| 3 | +This document aims to capture scenarios and a road map, both of which will |
| 4 | +serve as a basis for [stabilizing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable) |
| 5 | +the [existing semantic conventions for messaging](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md), |
| 6 | +which are currently in an [experimental](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#experimental) |
| 7 | +state. The goal is to declare messaging semantic conventions stable before the |
| 8 | +end of 2021. |
| 9 | + |
| 10 | +## Motivation |
| 11 | + |
| 12 | +Many observability scenarios involve messaging systems, event streaming, or |
| 13 | +event-driven architectures. For Distributed Tracing to be useful across the |
| 14 | +entire scenario, having good observability for messaging or eventing operations |
| 15 | +is critical. To achieve this, OpenTelemetry must provide stable conventions and |
| 16 | +guidelines for instrumenting those operations. Popular messaging systems that |
| 17 | +should be supported include Kafka, RabbitMQ, Apache RocketMQ, Azure Event Hubs |
| 18 | +and Service Bus, Amazon SQS, SNS, and Kinesis. |
| 19 | + |
| 20 | +Bringing the existing experimental semantic conventions for messaging to a |
| 21 | +stable state is a crucial step for users and instrumentation authors, as it |
| 22 | +allows them to rely on [stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability), |
| 23 | +and thus to ship and use stable instrumentation. |
| 24 | + |
| 25 | +## Roadmap |
| 26 | + |
| 27 | +1. This OTEP, consisting of scenarios and a proposed roadmap, is approved and |
| 28 | + merged. |
| 29 | +2. [Stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability) |
| 30 | + for semantic conventions are approved and merged. This is not strictly related |
| 31 | + to semantic conventions for messaging but is a prerequisite for stabilizing any |
| 32 | + semantic conventions. |
| 33 | +3. OTEPs proposing guidance for general instrumentation problems that also |
| 34 | + pertain to messaging are approved and merged. Those general instrumentation |
| 35 | + problems include retries and instrumentation layers. |
| 36 | +4. An OTEP proposing a set of attributes and conventions covering the scenarios |
| 37 | + in this document is approved and merged. |
| 38 | +5. Proposed specification changes are verified by prototypes for the scenarios |
| 39 | + and examples below. |
| 40 | +6. The [specification for messaging semantic conventions for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md) |
| 41 | + are updated according to the OTEP mentioned above and are declared |
| 42 | + [stable](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable). |
| 43 | + |
| 44 | +The steps in the roadmap don't necessarily need to happen in the given order, |
| 45 | +some steps can be worked on in parallel. |
| 46 | + |
| 47 | +## Terminology |
| 48 | + |
| 49 | +The terminology used in this document is based on the [CloudEvents specification](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md). |
| 50 | +CloudEvents is hosted by the CNCF and provides a specification for describing |
| 51 | +event data in common formats to provide interoperability across services, |
| 52 | +platforms and systems. |
| 53 | + |
| 54 | +### Message |
| 55 | + |
| 56 | +A "message" is a transport envelope for the transfer of information. The |
| 57 | +information is a combination of a payload and metadata. Metadata can be |
| 58 | +directed at consumers or at intermediaries on the message path. Messages are |
| 59 | +transferred via one or more intermediaries. Messages are uniquely |
| 60 | +identifiable. |
| 61 | + |
| 62 | +In the strict sense, a _message_ is a payload that is sent to a specific |
| 63 | +destination, whereas an _event_ is a signal emitted by a component upon |
| 64 | +reaching a given state. This document is agnostic of those differences and uses |
| 65 | +the term "message" in a wider sense to cover both concepts. |
| 66 | + |
| 67 | +### Producer |
| 68 | + |
| 69 | +The "producer" is a specific instance, process or device that creates and |
| 70 | +publishes a message. "Publishing" is the process of sending a message or batch |
| 71 | +to the intermediary or consumer. |
| 72 | + |
| 73 | +### Consumer |
| 74 | + |
| 75 | +A "consumer" receives the message and acts upon it. It uses the context and |
| 76 | +data to execute some logic, which might lead to the occurrence of new events. |
| 77 | + |
| 78 | +The consumer receives, processes, and settles a message. "Receiving" is the |
| 79 | +process of obtaining a message from the intermediary, "processing" is the |
| 80 | +process of acting on the information a message contains, "settling" is the |
| 81 | +process of notifying an intermediary that a message was processed successfully. |
| 82 | + |
| 83 | +### Intermediary |
| 84 | + |
| 85 | +An "intermediary" receives a message to forward it to the next receiver, which |
| 86 | +might be another intermediary or a consumer. |
| 87 | + |
| 88 | +## Scenarios |
| 89 | + |
| 90 | +Producing and consuming a message involves five stages: |
| 91 | + |
| 92 | +``` |
| 93 | +PRODUCER |
| 94 | +
|
| 95 | +Create |
| 96 | + | CONSUMER |
| 97 | + v +--------------+ |
| 98 | +Publish -> | INTERMEDIARY | -> Receive |
| 99 | + +--------------+ | |
| 100 | + ^ v |
| 101 | + . Process |
| 102 | + . | |
| 103 | + . v |
| 104 | + . . . . . . Settle |
| 105 | +``` |
| 106 | + |
| 107 | +1. The producer creates a message. |
| 108 | +2. The producer publishes the message to an intermediary. |
| 109 | +3. The consumer receives the message from an intermediary. |
| 110 | +4. The consumer processes the message. |
| 111 | +5. The consumer settles the message by notifying the intermediary that the |
| 112 | + message was processed. In some cases (fire-and-forget), the settlement stage |
| 113 | + does not exist. |
| 114 | + |
| 115 | +The messaging semantic conventions need to define how to model those stages in |
| 116 | +traces, how to propagate context, and how to enrich traces with attributes. |
| 117 | +Failures and retries need to be handled in all stages that interface with the |
| 118 | +intermediary (publish, receive and settle) and will be covered by general |
| 119 | +instrumentation guidance. |
| 120 | + |
| 121 | +Based on this model, the following scenarios capture major requirements and |
| 122 | +can be used for prototyping, as examples, and as test cases. |
| 123 | + |
| 124 | +### Individual settlement |
| 125 | + |
| 126 | +Individual settlement systems imply independent logical message flows. A single |
| 127 | +message is created and published in the same context, and it's delivered, |
| 128 | +consumed, and settled as a single entity. Each message needs to be settled |
| 129 | +individually. Usually, settlement information is stored by the intermediary, not |
| 130 | +by the consumer. |
| 131 | + |
| 132 | +Transport batching can be treated as a special case: messages can be |
| 133 | +transported together as an optimization, but are produced and consumed |
| 134 | +individually. |
| 135 | + |
| 136 | +As the diagram below shows, each message can be settled individually, |
| 137 | +regardless of the position of the message in the stream or queue. In contrast |
| 138 | +to checkpoint-based settlement, settlement information is related to individual |
| 139 | +messages and not to the overall message stream. |
| 140 | + |
| 141 | +``` |
| 142 | ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ |
| 143 | +|Message A| |Message B| |Message C| |Message D| |Message E| |Message F| |
| 144 | ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ |
| 145 | + Settled Settled Settled |
| 146 | +``` |
| 147 | + |
| 148 | +#### Examples |
| 149 | + |
| 150 | +1. The following configurations should be instrumented and tested for RabbitMQ |
| 151 | + or a similar messaging system: |
| 152 | + |
| 153 | + * 1 producer, 1 queue, 2 consumers |
| 154 | + * 1 producer, fanout exchange to 2 queues, 2 consumers |
| 155 | + * 2 producers, fanout exchange to 2 queues, 2 consumers |
| 156 | + |
| 157 | + Each of the producers continuously produces messages. |
| 158 | + |
| 159 | +### Checkpoint-based settlement |
| 160 | + |
| 161 | +Messages are processed as a stream and settled by moving a checkpoint. A |
| 162 | +checkpoint points to a position of the stream up to which messages were |
| 163 | +processed and settled. Messages cannot be settled individually, instead, the |
| 164 | +checkpoint needs to be forwarded. Usually, the consumer is responsible for |
| 165 | +storing checkpointing information, not the intermediary. |
| 166 | + |
| 167 | +Checkpoint-based settlement systems are designed to efficiently receive and |
| 168 | +settle batches of messages. However, it is not possible to settle messages |
| 169 | +independent of their position in the stream (e. g., if message B is located at |
| 170 | +a later position in the stream than message A, then message B cannot be settled |
| 171 | +without also settling message A). |
| 172 | + |
| 173 | +As the diagram below shows, messages cannot be settled individually. Instead, |
| 174 | +settlement information is related to the overall ordered message stream. |
| 175 | + |
| 176 | +``` |
| 177 | + Checkpoint |
| 178 | + | |
| 179 | + v |
| 180 | ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ |
| 181 | +|Message A| |Message B| |Message C| |Message D| |Message E| |Message F| |
| 182 | ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ |
| 183 | + <--- Settled |
| 184 | +``` |
| 185 | + |
| 186 | +#### Examples |
| 187 | + |
| 188 | +1. The following configurations should be instrumented and tested for Kafka or |
| 189 | + a similar messaging system: |
| 190 | + |
| 191 | + * 1 producer, 2 consumers in the same consumer group |
| 192 | + * 1 producer, 2 consumers in different consumer groups |
| 193 | + * 2 producers, 2 consumers in the same consumer group |
| 194 | + |
| 195 | + Each of the producers produces a continuous stream of messages. |
| 196 | + |
| 197 | +## Open questions |
| 198 | + |
| 199 | +The following areas are considered out-of-scope of a first stable release of |
| 200 | +semantic conventions for messaging. While not being explicitly considered for |
| 201 | +a first stable release, it is important to ensure that this first stable |
| 202 | +release can serve as a solid foundation for further improvements in these areas. |
| 203 | + |
| 204 | +### Sampling |
| 205 | + |
| 206 | +The current experimental semantic conventions rely heavily on span links as |
| 207 | +a way to correlate spans. This is necessary, as several traces are needed to |
| 208 | +model the complete path that a message takes through the system. With the currently |
| 209 | +available sampling capabilities of OpenTelemetry, it is not possible to ensure |
| 210 | +that a set of linked traces is sampled. As a result, it is unlikely to sample a |
| 211 | +set of traces that covers the complete path a message takes. |
| 212 | + |
| 213 | +Solving this problem requires a solution for sampling based on span links, |
| 214 | +which is not in scope for this OTEP. |
| 215 | + |
| 216 | +However, having a too high number of span links in a single trace or having too |
| 217 | +many traces linked together can make the visualization and analysis of traces |
| 218 | +inefficient. This problem is not related to sampling and needs to be addressed |
| 219 | +by the semantic conventions. |
| 220 | + |
| 221 | +### Instrumenting intermediaries |
| 222 | + |
| 223 | +Instrumenting intermediaries can be valuable for debugging configuration or |
| 224 | +performance issues, or for detecting specific intermediary failures. |
| 225 | + |
| 226 | +Stable semantic conventions for instrumenting intermediaries can be provided at |
| 227 | +a future point in time, but are not in scope for this OTEP. The messaging |
| 228 | +semantic conventions this document refers to need to provide instrumentation |
| 229 | +that works well without the need to have intermediaries instrumented. |
| 230 | + |
| 231 | +### Metrics |
| 232 | + |
| 233 | +Messaging semantic conventions for tracing and for metrics overlap and should |
| 234 | +be as consistent as possible. However, semantic conventions for metrics will be |
| 235 | +handled separately and are not in scope for this OTEP. |
| 236 | + |
| 237 | +### Asynchronous message passing in the wider sense |
| 238 | + |
| 239 | +Asynchronous message passing in the wider sense is a communication method |
| 240 | +wherein the system puts a message in a queue or channel and does not require an |
| 241 | +immediate response to continue processing. This can range from utilizing a |
| 242 | +simple queue implementation to a full-fledged messaging system. |
| 243 | + |
| 244 | +Messaging semantic conventions are intended for systems that fit into one of |
| 245 | +the [scenarios laid out in the previous section](#scenarios), which cover a |
| 246 | +significant part of asynchronous message passing applications. However, there |
| 247 | +are low-level patterns of asynchronous message passing that don't fit in any of |
| 248 | +those scenarios, e. g. channels in Go, or message passing in Erlang. Those |
| 249 | +might be covered by a different set of semantic conventions in the future. |
| 250 | + |
| 251 | +There also exist several frameworks for queuing and executing background jobs, |
| 252 | +often those frameworks utilize patterns of asynchronous message passing to |
| 253 | +queue jobs. Those frameworks might utilize messaging semantic conventions if |
| 254 | +they fit in any of the [scenarios laid out in the previous section](#scenarios), |
| 255 | +but otherwise targeting those various frameworks is not an explicit goal for |
| 256 | +these conventions. Those frameworks might be covered by [semantic conventions for "jobs"](https://github.com/open-telemetry/opentelemetry-specification/pull/1582) |
| 257 | +in the future. |
| 258 | + |
| 259 | +## Further reading |
| 260 | + |
| 261 | +* [CloudEvents](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md) |
| 262 | +* [Message-Driven (in contrast to Event-Driven)](https://www.reactivemanifesto.org/glossary#Message-Driven) |
| 263 | +* [Asynchronous message passing](https://en.wikipedia.org/wiki/Message_passing#Asynchronous_message_passing) |
| 264 | +* [Existing semantic conventions for messaging](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md) |
0 commit comments