Skip to content

Commit fad8665

Browse files
pyohannescarlosalberto
authored andcommitted
Proposed scenarios and roadmap for messaging semantic conventions for tracing (open-telemetry/oteps#173)
[Semantic conventions for messaging systems for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md) are available, but are in an experimental state. A [workgroup focusing on messaging semantic conventions](open-telemetry/community#819) will work on bringing the existing semantic conventions for messaging to a stable state. The workgroup meets on **Thursdays at 8AM PST**. This documents proposes a scope for an initial stable version of messaging semantic conventions, as well as a roadmap. It should serve as a starting point for initial discussions in the workgroup and, once agreed on, define the further agenda of the workgroup.
1 parent 4ce5194 commit fad8665

1 file changed

Lines changed: 264 additions & 0 deletions

File tree

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# Scenarios for Tracing semantic conventions for messaging
2+
3+
This document aims to capture scenarios and a road map, both of which will
4+
serve as a basis for [stabilizing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable)
5+
the [existing semantic conventions for messaging](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md),
6+
which are currently in an [experimental](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#experimental)
7+
state. The goal is to declare messaging semantic conventions stable before the
8+
end of 2021.
9+
10+
## Motivation
11+
12+
Many observability scenarios involve messaging systems, event streaming, or
13+
event-driven architectures. For Distributed Tracing to be useful across the
14+
entire scenario, having good observability for messaging or eventing operations
15+
is critical. To achieve this, OpenTelemetry must provide stable conventions and
16+
guidelines for instrumenting those operations. Popular messaging systems that
17+
should be supported include Kafka, RabbitMQ, Apache RocketMQ, Azure Event Hubs
18+
and Service Bus, Amazon SQS, SNS, and Kinesis.
19+
20+
Bringing the existing experimental semantic conventions for messaging to a
21+
stable state is a crucial step for users and instrumentation authors, as it
22+
allows them to rely on [stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability),
23+
and thus to ship and use stable instrumentation.
24+
25+
## Roadmap
26+
27+
1. This OTEP, consisting of scenarios and a proposed roadmap, is approved and
28+
merged.
29+
2. [Stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability)
30+
for semantic conventions are approved and merged. This is not strictly related
31+
to semantic conventions for messaging but is a prerequisite for stabilizing any
32+
semantic conventions.
33+
3. OTEPs proposing guidance for general instrumentation problems that also
34+
pertain to messaging are approved and merged. Those general instrumentation
35+
problems include retries and instrumentation layers.
36+
4. An OTEP proposing a set of attributes and conventions covering the scenarios
37+
in this document is approved and merged.
38+
5. Proposed specification changes are verified by prototypes for the scenarios
39+
and examples below.
40+
6. The [specification for messaging semantic conventions for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md)
41+
are updated according to the OTEP mentioned above and are declared
42+
[stable](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable).
43+
44+
The steps in the roadmap don't necessarily need to happen in the given order,
45+
some steps can be worked on in parallel.
46+
47+
## Terminology
48+
49+
The terminology used in this document is based on the [CloudEvents specification](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md).
50+
CloudEvents is hosted by the CNCF and provides a specification for describing
51+
event data in common formats to provide interoperability across services,
52+
platforms and systems.
53+
54+
### Message
55+
56+
A "message" is a transport envelope for the transfer of information. The
57+
information is a combination of a payload and metadata. Metadata can be
58+
directed at consumers or at intermediaries on the message path. Messages are
59+
transferred via one or more intermediaries. Messages are uniquely
60+
identifiable.
61+
62+
In the strict sense, a _message_ is a payload that is sent to a specific
63+
destination, whereas an _event_ is a signal emitted by a component upon
64+
reaching a given state. This document is agnostic of those differences and uses
65+
the term "message" in a wider sense to cover both concepts.
66+
67+
### Producer
68+
69+
The "producer" is a specific instance, process or device that creates and
70+
publishes a message. "Publishing" is the process of sending a message or batch
71+
to the intermediary or consumer.
72+
73+
### Consumer
74+
75+
A "consumer" receives the message and acts upon it. It uses the context and
76+
data to execute some logic, which might lead to the occurrence of new events.
77+
78+
The consumer receives, processes, and settles a message. "Receiving" is the
79+
process of obtaining a message from the intermediary, "processing" is the
80+
process of acting on the information a message contains, "settling" is the
81+
process of notifying an intermediary that a message was processed successfully.
82+
83+
### Intermediary
84+
85+
An "intermediary" receives a message to forward it to the next receiver, which
86+
might be another intermediary or a consumer.
87+
88+
## Scenarios
89+
90+
Producing and consuming a message involves five stages:
91+
92+
```
93+
PRODUCER
94+
95+
Create
96+
| CONSUMER
97+
v +--------------+
98+
Publish -> | INTERMEDIARY | -> Receive
99+
+--------------+ |
100+
^ v
101+
. Process
102+
. |
103+
. v
104+
. . . . . . Settle
105+
```
106+
107+
1. The producer creates a message.
108+
2. The producer publishes the message to an intermediary.
109+
3. The consumer receives the message from an intermediary.
110+
4. The consumer processes the message.
111+
5. The consumer settles the message by notifying the intermediary that the
112+
message was processed. In some cases (fire-and-forget), the settlement stage
113+
does not exist.
114+
115+
The messaging semantic conventions need to define how to model those stages in
116+
traces, how to propagate context, and how to enrich traces with attributes.
117+
Failures and retries need to be handled in all stages that interface with the
118+
intermediary (publish, receive and settle) and will be covered by general
119+
instrumentation guidance.
120+
121+
Based on this model, the following scenarios capture major requirements and
122+
can be used for prototyping, as examples, and as test cases.
123+
124+
### Individual settlement
125+
126+
Individual settlement systems imply independent logical message flows. A single
127+
message is created and published in the same context, and it's delivered,
128+
consumed, and settled as a single entity. Each message needs to be settled
129+
individually. Usually, settlement information is stored by the intermediary, not
130+
by the consumer.
131+
132+
Transport batching can be treated as a special case: messages can be
133+
transported together as an optimization, but are produced and consumed
134+
individually.
135+
136+
As the diagram below shows, each message can be settled individually,
137+
regardless of the position of the message in the stream or queue. In contrast
138+
to checkpoint-based settlement, settlement information is related to individual
139+
messages and not to the overall message stream.
140+
141+
```
142+
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+
143+
|Message A| |Message B| |Message C| |Message D| |Message E| |Message F|
144+
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+
145+
Settled Settled Settled
146+
```
147+
148+
#### Examples
149+
150+
1. The following configurations should be instrumented and tested for RabbitMQ
151+
or a similar messaging system:
152+
153+
* 1 producer, 1 queue, 2 consumers
154+
* 1 producer, fanout exchange to 2 queues, 2 consumers
155+
* 2 producers, fanout exchange to 2 queues, 2 consumers
156+
157+
Each of the producers continuously produces messages.
158+
159+
### Checkpoint-based settlement
160+
161+
Messages are processed as a stream and settled by moving a checkpoint. A
162+
checkpoint points to a position of the stream up to which messages were
163+
processed and settled. Messages cannot be settled individually, instead, the
164+
checkpoint needs to be forwarded. Usually, the consumer is responsible for
165+
storing checkpointing information, not the intermediary.
166+
167+
Checkpoint-based settlement systems are designed to efficiently receive and
168+
settle batches of messages. However, it is not possible to settle messages
169+
independent of their position in the stream (e. g., if message B is located at
170+
a later position in the stream than message A, then message B cannot be settled
171+
without also settling message A).
172+
173+
As the diagram below shows, messages cannot be settled individually. Instead,
174+
settlement information is related to the overall ordered message stream.
175+
176+
```
177+
Checkpoint
178+
|
179+
v
180+
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+
181+
|Message A| |Message B| |Message C| |Message D| |Message E| |Message F|
182+
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+
183+
<--- Settled
184+
```
185+
186+
#### Examples
187+
188+
1. The following configurations should be instrumented and tested for Kafka or
189+
a similar messaging system:
190+
191+
* 1 producer, 2 consumers in the same consumer group
192+
* 1 producer, 2 consumers in different consumer groups
193+
* 2 producers, 2 consumers in the same consumer group
194+
195+
Each of the producers produces a continuous stream of messages.
196+
197+
## Open questions
198+
199+
The following areas are considered out-of-scope of a first stable release of
200+
semantic conventions for messaging. While not being explicitly considered for
201+
a first stable release, it is important to ensure that this first stable
202+
release can serve as a solid foundation for further improvements in these areas.
203+
204+
### Sampling
205+
206+
The current experimental semantic conventions rely heavily on span links as
207+
a way to correlate spans. This is necessary, as several traces are needed to
208+
model the complete path that a message takes through the system. With the currently
209+
available sampling capabilities of OpenTelemetry, it is not possible to ensure
210+
that a set of linked traces is sampled. As a result, it is unlikely to sample a
211+
set of traces that covers the complete path a message takes.
212+
213+
Solving this problem requires a solution for sampling based on span links,
214+
which is not in scope for this OTEP.
215+
216+
However, having a too high number of span links in a single trace or having too
217+
many traces linked together can make the visualization and analysis of traces
218+
inefficient. This problem is not related to sampling and needs to be addressed
219+
by the semantic conventions.
220+
221+
### Instrumenting intermediaries
222+
223+
Instrumenting intermediaries can be valuable for debugging configuration or
224+
performance issues, or for detecting specific intermediary failures.
225+
226+
Stable semantic conventions for instrumenting intermediaries can be provided at
227+
a future point in time, but are not in scope for this OTEP. The messaging
228+
semantic conventions this document refers to need to provide instrumentation
229+
that works well without the need to have intermediaries instrumented.
230+
231+
### Metrics
232+
233+
Messaging semantic conventions for tracing and for metrics overlap and should
234+
be as consistent as possible. However, semantic conventions for metrics will be
235+
handled separately and are not in scope for this OTEP.
236+
237+
### Asynchronous message passing in the wider sense
238+
239+
Asynchronous message passing in the wider sense is a communication method
240+
wherein the system puts a message in a queue or channel and does not require an
241+
immediate response to continue processing. This can range from utilizing a
242+
simple queue implementation to a full-fledged messaging system.
243+
244+
Messaging semantic conventions are intended for systems that fit into one of
245+
the [scenarios laid out in the previous section](#scenarios), which cover a
246+
significant part of asynchronous message passing applications. However, there
247+
are low-level patterns of asynchronous message passing that don't fit in any of
248+
those scenarios, e. g. channels in Go, or message passing in Erlang. Those
249+
might be covered by a different set of semantic conventions in the future.
250+
251+
There also exist several frameworks for queuing and executing background jobs,
252+
often those frameworks utilize patterns of asynchronous message passing to
253+
queue jobs. Those frameworks might utilize messaging semantic conventions if
254+
they fit in any of the [scenarios laid out in the previous section](#scenarios),
255+
but otherwise targeting those various frameworks is not an explicit goal for
256+
these conventions. Those frameworks might be covered by [semantic conventions for "jobs"](https://github.com/open-telemetry/opentelemetry-specification/pull/1582)
257+
in the future.
258+
259+
## Further reading
260+
261+
* [CloudEvents](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md)
262+
* [Message-Driven (in contrast to Event-Driven)](https://www.reactivemanifesto.org/glossary#Message-Driven)
263+
* [Asynchronous message passing](https://en.wikipedia.org/wiki/Message_passing#Asynchronous_message_passing)
264+
* [Existing semantic conventions for messaging](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md)

0 commit comments

Comments
 (0)