Allow multiple Resource instances per SDK.
OpenTelemetry needs to address two fundamental problems:
- Reporting data against "mutable" or "changing" entities, where currently an
SDK is allowed a single
Resource, whose lifetime must match the lifetime of the SDK itself. - Providing true multi-tenant capabilities, where, e.g. metrics about one tenant will be implicitly separated from metrics about another tenant.
The first problem is well outlined in (not accepted) OTEP 4316.
Fundamentally, while we need an immutable identity, the reality is that Resource
in today's OpenTelemetry usage is not strong enough to support key use cases. For example,
OpenTelemetry JS, in the node.js environment, cannot guarantee that all identifying
attributes for Resource are discovered prior to SDK startup, leading to an "eventual identity" situation
that must be addressed in the Specification. Additionally, our Client/Browser SIG has been
trying to model the notion of "User Session" which has a much shorter lifespan than the SDK itself, so
requiring a single identity that is both immutable and matches the SDK lifetime prevents any good mechanism of reporting user session.
However, OTEP 4316 explores relaxing the immutability restriction vs. providing a new mechanism. During prototyping, initially this seemed to be easily accomplished, but ran into major complications both in interactions with OpAmp (where a stable identity for the SDK is desired), and in designing a Metrics SDK, where changes in Resource mean a dynamic and divergent storage strategy, without a priori knowledge of whether these resource mutations are relevant to the metric or not.
Additionally, today when reporting data from one "process" about multiple resources, the only recourse available is to instantiate multiple SDKs and define different resources in each SDK. This absolute separation can be highly problematic with the notion of "built-in" instrumentation, where libraries (e.g. gRPC) come with an out-of-the-box OpenTelemetry support and it's unclear how to ensure this instrumentation is use correctly in the presence of multiple SDKs.
We proposes these new fundamental concepts in OpenTelemetry:
Resourceremains immutable- Building on OTEP 264, identifying attributes are clearly outlined in Resource going forward, addressing unclear real world usage of Resource attributes (e,g, identifying attributes in OpAMP).
- SDK will be given an explicit initialization stage where
Resourceis not in a complete state, addressing OpenTelemetry JS concerns around async resource detection.
- The SDK will be identified by a single
Resourceprovided during SDK startup.- ResourceDetection will be expanded, as described in OTEP 264.
- An explicit section about SDK initialization will be created.
- Signal Providers in the SDK will allow "specialization" of the default SDK
resource. We create new
{Signal}Providerinstances by providing a newEntityon the existing provider.- This will construct a new
Resourcespecific to that provider. - The new provider will re-use all configuration (e.g. export pipeline) defined from the base provider.
- This will construct a new
This proposal splits between an instrumentation facing API, and required behavior of that API in the SDK.
Previously, every {Signal}Provider API defined a single
Get a {Signal} operation. These will be expanded with a new
For Entity operation, which will construct a new {SignalProvider}
API component for reporting against a specific Entity.
This API MUST accept the following parameters:
entities: Specifies theEntityset to associate with emitted telemetry.
Any entities provided which conflict with those entities already provided in
the SDK Resource represent an override of identity. The SDK MUST resolve the
conflict without causing a fatal error.
The set of Entity provided to these operations MUST only include one Entity
per type.
An Entity is a collection of the following values:
type: A string describing the class of the Entity.id: An attribute collection that identifies an instance of the Entity.- (optional)
description: An attribute collection that describes the instance of the Entity. - (optional)
schema_url: Specifies the Schema URL that should be recorded for this Entity.
An Entity is uniquely identified by the combination of its type and id.
schema_url defines version of the schema used to describe an Entity. If
two entities exist with the same identity and different schema_urls, they
MUST be considered in conflict with each other.
When For Entity operation is received by a provider, A new child
Entity Bound Provider of the same type MUST be created and returned with the
following restrictions:
Entity Bound ProviderMUST be associated with a newly createdResourcewhich is the result of the incomingEntityset merged into the originalProvider's resource following the existingResourcemerging algorithm. Telemetry created by the parent MUST continue to be associated with the original unmodified resource.- The
Bound ProviderMUST share an export pipeline with its parent. The export component (SpanProcessor,MetricReader,LogsProcessor, etc) MUST not beShutdownby theBound Provider. This MAY be achieved by wrapping the export component in a proxy component which ignores calls toShutdownor translates them intoForce Flush. - The
Bound ProviderMUST be configured exactly the same as its parent. A configuration change on a parentProviderMUST be reflected in all of its childEntity Bound Providers. This MAY be achieved by directly sharing the configuration object betweenProviders. - A
Bound ProviderMUST NOT be directly configurable. All configuration comes from its parent. - If
ForceFlushorShutdownis called on aProviderit MUST also flush all of its childEntity Bound Providers. - If
Shutdownis called on aBound Providerit MUST be treated as aForce Flush. It MUST NOT shut down its export pipeline.
The primary trade-offs to make here are around "breaking changes" and subtle differences in generated telemetry for code leveraging Entity vs. code which does not. We need give room for consumers of OTLP (vendors, databases, collector) time to support the new semantics of Entity prior to data showing up which would not be correctly interpreted without understanding these new semantics. As such, primarily:
Entity, as defined in OTLP, is an opt-in construct.Resourceshould be usable as an identity independent ofEntity.- Consumers should now expect SDKs reporting multiple resources in the same batch. Theoretically, this SHOULD already be supported due to how OTLP is designed to allow aggregation / batching of data at any point.
OpenCensus previously allowed contextual tags to be specified dynamically and used everywhere metric measurements were reported. Users were then required to select which of these were useful to them via the definition of "Views". OpenTelemetry has aimed for a simpler solution where every metric has an implicit View definition, and we leverage metric advice to allow sending attributes than is naturally used when reporting the metric.
As called out in the description, OTEP 4316 proposes making resource fully mutable, which comes with its own set of tradeoffs.
Today, Semantic Conventions already defines Entity and uses it to group and
report Resource attributes cohesively. Additionally, Semantic convention only
models "entity associations", that is requiring a signal (e.g. a metric, event
or span) to be attached to an entity. For example, the system.cpu.time metric
is expected to be associated with a host entity. This association makes no
assumption about whether that is through Resource or some other mechanism,
and can therefore be extended to support InstrumentationScope based entities.
Adding entity in InstrumentationScope has a lot of implications that must be resolved.
As seen in Issue #3062, systems observing multiple tenants need to ensure that tenants which are only observed briefly do not continue to consume resources (particularly memory) for long periods of time. There needs to be some level of control (either direct or implicit) in allowing new "Scope with Entity" to be created.
Should we consider this a failure or a feature?
We currently consider this a feature. Upon conflict, the new Entity would be
used in the resulting Resource reported for a new {SignalProvider}.
The SDK needs some form of stable identity for itself, however when reporting Telemetry, it may be recording data on behalf of some other system.
Its not clear how resource immutability is kept or what is meant by immutable. Will the resource emitted on the first export be the same as the one emitted for the entire lifetime of the process? Are descriptive attributes on entities attached to resource still allowed to change? What about attaching new entities to that resource?
For now:
- The set of entities reported on Resource becomes locked. All identifying attributes are also locked.
- Whether we want to allow descriptive attributes to change - this can be determined or evolve over time. Until the ecosystem around OTLP is leveraging the "identity" attributes of Entity for Resource, we should not allow mutation of descriptive attributes.
There should be no impact on collector components beyond those defined in OTEP 4316.
We will have clear guidance on the For Entity methods
- Java: https://github.com/open-telemetry/opentelemetry-java/compare/main...jsuereth:opentelemetry-java:wip-entity-and-providers
- TypeScript: open-telemetry/opentelemetry-js#5620
This proposal brings strong multi-tenant capabilities to the OpenTelemetry SDK. One possibility
is to improve the interaction between dynamic Context and signals, e.g. allowing
some level of interaction Context and attributes / entities.
For example, rather than a lexical scope:
const myMeterProvider = globalMeterProvider.forEntity(getCurrentSession())
doSomething(myMeterProvider)We could allow runtime scope:
const ctx = api.context.active();
api.context.with(ctx.setValue(CURRENT_ENTITY_KEY, getCurrentSession()))
doSomething()