Skip to content

Commit ac126f6

Browse files
[blog] Add Adobe OTel Collector usage story (#9411)
Co-authored-by: Tiffany Hrabusa <30397949+tiffany76@users.noreply.github.com>
1 parent 8a88850 commit ac126f6

File tree

3 files changed

+234
-0
lines changed

3 files changed

+234
-0
lines changed
79.7 KB
Loading
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
---
2+
title: "Inside Adobe's OpenTelemetry pipeline: simplicity at scale"
3+
linkTitle: "Inside Adobe's OpenTelemetry pipeline: simplicity at scale"
4+
date: 2026-04-08
5+
author: >-
6+
[Johanna Öjeling](https://github.com/johannaojeling) (Grafana Labs), [Juliano
7+
Costa](https://github.com/julianocosta89) (Datadog), [Tristan
8+
Sloughter](https://github.com/tsloughter) (community), [Damien
9+
Mathieu](https://github.com/dmathieu) (Elastic), [Bogdan
10+
Stancu](https://github.com/bogdan-st) (Adobe)
11+
sig: Developer Experience SIG
12+
cSpell:ignore: devex Sloughter Öjeling
13+
---
14+
15+
As part of an ongoing series, the Developer Experience SIG interviews
16+
organizations about their real-world OpenTelemetry Collector deployments to
17+
share practical lessons with the broader community. This post features Adobe, a
18+
global software company whose observability team has built an
19+
OpenTelemetry-based telemetry pipeline designed for simplicity at massive scale,
20+
with thousands of collectors running per signal type across the company's
21+
infrastructure.
22+
23+
## Organizational structure
24+
25+
Adobe's central observability team is responsible for providing observability
26+
infrastructure across the company. However, as
27+
[Bogdan Stancu](https://github.com/bogdan-st), Senior Software Engineer,
28+
explained, Adobe's history of acquisitions means the landscape is not fully
29+
consolidated. Some large product groups have their own dedicated observability
30+
teams, while the central team serves as the primary provider.
31+
32+
The OpenTelemetry-based pipeline was introduced as a new option alongside
33+
existing monitoring solutions, designed primarily for new applications and
34+
deployments. Adoption is voluntary, not mandated. Existing applications with
35+
established monitoring have not been migrated.
36+
37+
## OpenTelemetry adoption
38+
39+
The decision to adopt OpenTelemetry was driven by alignment between the
40+
project's capabilities and the team's goals. The observability team needed a
41+
solution that could serve Adobe's diverse technology landscape, support multiple
42+
backends, and remain simple for service teams to adopt.
43+
44+
> "It matched everything that we wanted," Bogdan said.
45+
46+
The [OpenTelemetry Operator](/docs/platforms/kubernetes/operator/), the
47+
Collector's component model, and community Helm charts provided the building
48+
blocks for a platform-level observability offering that could scale without
49+
requiring deep OpenTelemetry expertise from individual service teams.
50+
51+
## Architecture: a three-tier collector pipeline
52+
53+
Adobe's collector architecture follows a three-tier design: a user-facing Helm
54+
chart containing two collectors, a centralized managed namespace with per-signal
55+
collector deployments, and the observability backends.
56+
57+
![Adobe architecture diagram](adobe-architecture.png)
58+
59+
### Tier 1: the user Helm chart
60+
61+
The observability team provides a Helm chart that service teams deploy into
62+
their own namespaces. This chart creates two collectors:
63+
64+
**Sidecar Collector (in the application pod)**: Runs alongside the application
65+
container and is intentionally locked down. Service teams cannot modify its
66+
configuration. It collects all telemetry: metrics, logs, traces, regardless of
67+
what the team has chosen to export downstream. The configuration is immutable to
68+
prevent application restarts caused by configuration changes.
69+
70+
**Deployment Collector (standalone)**: Receives telemetry from the sidecar over
71+
OTLP and handles routing and export. Unlike the sidecar, this collector _is_
72+
configurable through Helm values. The observability team provides sensible
73+
defaults, but service teams can customize exporters and add new destinations.
74+
When configuration changes, only the deployment collector restarts. The
75+
application pod and its sidecar remain untouched.
76+
77+
### Tier 2: the managed namespace
78+
79+
The deployment collectors forward telemetry to a centralized namespace managed
80+
entirely by the observability team. A key architectural decision here is
81+
signal-level isolation: the managed namespace runs a separate collector
82+
deployment for each telemetry type: one for metrics, one for logs, and one for
83+
traces.
84+
85+
If a backend becomes rate-limited or starts rejecting data for one signal type,
86+
the others continue flowing uninterrupted. Despite handling thousands of
87+
collectors' worth of upstream traffic, these managed deployments have generally
88+
operated at default replica counts without requiring aggressive auto-scaling.
89+
90+
Service teams configure their desired backend through Helm values, which sets an
91+
HTTP header on OTLP exports. The managed namespace collectors use this header
92+
with the
93+
[routing connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/6aff35ab5351482a4664f29a7d5428cedcf61a92/connector/routingconnector?from_branch=main)
94+
to direct telemetry to the correct exporter.
95+
96+
### Tier 3: the observability backends
97+
98+
The managed namespace collectors export telemetry to backend destinations
99+
managed by the observability team. Multiple backends are supported, and teams
100+
select their destination through the Helm chart's values file.
101+
102+
## Auto-instrumentation: two lines and it works
103+
104+
Adobe leverages the OpenTelemetry Operator for auto-instrumentation across the
105+
languages supported by OpenTelemetry. The Operator is deployed to every cluster,
106+
and service teams enable instrumentation by adding two annotations to their
107+
Kubernetes deployment manifests:
108+
109+
```yaml
110+
instrumentation.opentelemetry.io/inject-java: 'true'
111+
sidecar.opentelemetry.io/inject: 'true'
112+
```
113+
114+
> "People add two lines in their deployment. And it just works," Bogdan said.
115+
116+
Teams select their language in the Helm values, and the Operator handles the
117+
rest. While teams are free to add manual SDK instrumentation—the sidecar accepts
118+
all OTLP data—the observability team's supported path focuses on the
119+
auto-instrumentation experience. The Operator has handled the scale of managing
120+
sidecars and auto-instrumentation across the deployment fleet without issues.
121+
122+
This design philosophy runs through the entire platform: make the default path
123+
require as little effort as possible, while leaving the door open for advanced
124+
use cases.
125+
126+
## Custom distribution and components
127+
128+
Adobe builds its own OpenTelemetry Collector distribution to include only the
129+
components they use, avoiding unnecessary dependencies from Contrib. This custom
130+
distribution is the default in the Helm chart provided to service teams.
131+
However, teams can manually switch to the Contrib distribution if they need
132+
components not included in the custom build.
133+
134+
Adobe also maintains custom components, most notably an extension addressing a
135+
fundamental challenge in their chained collector architecture.
136+
137+
### The chain collector problem
138+
139+
When collectors are chained, error visibility becomes a problem. The OTLP
140+
transaction between the user's deployment collector and the managed namespace
141+
collector completes with a 200 response _before_ the managed namespace collector
142+
attempts to export to the backend. If the backend rejects the data, the error is
143+
only visible in the managed namespace collector's logs.
144+
145+
> "The user would just see 200s. Metrics exported, all good," Bogdan explained.
146+
> "Which we didn't want."
147+
148+
To address this, Bogdan built a custom extension that acts as a circuit breaker
149+
for backend authentication. The extension runs in the managed namespace
150+
collector's receiver, proactively sending mock authentication requests to the
151+
backend and caching results. If authentication fails, it returns a 401 to the
152+
upstream collector before the OTLP transaction completes, propagating the error
153+
back to where users can see it.
154+
155+
Building this extension was one of Bogdan's first Go projects. The experience of
156+
trying to contribute upstream sparked deeper involvement with the OpenTelemetry
157+
community. Looking ahead, Bogdan would welcome a more general back-pressure
158+
mechanism in the Collector, where exporter failures propagate upstream through
159+
chained collectors.
160+
161+
## Deployment and lifecycle management
162+
163+
The observability team upgrades their collector distribution and the
164+
OpenTelemetry Operator on a quarterly cadence. Upgrade issues have been rare.
165+
166+
When the Helm chart is updated, service teams pick up the new collector version
167+
on their next deployment. However, the observability team has encountered a
168+
compatibility challenge between the Operator and older collector versions: when
169+
the Operator is upgraded, it can modify the `OpenTelemetryCollector` custom
170+
resource to align with new configuration expectations. If a service team is
171+
running a significantly older collector version, these changes can be
172+
incompatible, preventing collectors from starting.
173+
174+
The resolution is straightforward—upgrading the collector fixes the issue—but it
175+
has caused confusion for teams whose collectors suddenly break without any
176+
changes on their end.
177+
178+
### Navigating component deprecations
179+
180+
Adobe's deployment has also navigated component deprecations as OpenTelemetry
181+
evolves. The team originally used the routing processor to direct telemetry to
182+
different backends based on HTTP headers, but migrated to the routing connector
183+
when the processor was deprecated.
184+
185+
While the migration required work, the team views this as an expected part of
186+
working with a rapidly evolving project.
187+
188+
> "This is a risk we knew about, the whole OpenTelemetry landscape is changing
189+
> constantly and the benefits outweigh the 'issues' if you can call fast
190+
> development an issue," Bogdan explained.
191+
192+
## What works well
193+
194+
The overall experience has been positive. The Collector's component model, the
195+
auto-instrumentation experience via the Operator, and the Helm chart-based
196+
deployment model have all worked reliably. The plug-and-play nature of the
197+
platform, where teams go from zero to full observability with minimal
198+
configuration, has been positively received by adopting teams.
199+
200+
## Advice for others
201+
202+
Based on Adobe's experience building a platform-level observability pipeline:
203+
204+
- **Treat OpenTelemetry as a platform to build on**: Don't expect it to solve
205+
all your problems out of the box. It's designed to be extended and customized
206+
for your specific needs.
207+
- **Don't be afraid to build custom components**: The Collector's architecture
208+
makes it straightforward to build extensions tailored to your needs.
209+
- **Design for user simplicity**: Make the default path require minimal effort.
210+
The teams consuming your platform are not observability experts.
211+
- **Plan for error visibility in chained collectors**: OTLP transaction success
212+
does not guarantee end-to-end delivery. Consider how errors will surface to
213+
users.
214+
215+
## What's next
216+
217+
Adobe's story illustrates how a central observability team can offer a scalable,
218+
self-service OpenTelemetry pipeline across a large and diverse organization. By
219+
combining the Operator, Helm charts, sidecars, and per-signal collector
220+
deployments, they've created a platform where service teams get observability
221+
with minimal effort, while the observability team retains control over
222+
centralized infrastructure.
223+
224+
We'll continue sharing stories like this one, highlighting how different
225+
organizations tackle the challenges of running OpenTelemetry in production.
226+
227+
Have your own OpenTelemetry story to share? Join us in the CNCF
228+
[#otel-devex](https://cloud-native.slack.com/archives/C01S42U83B2) Slack
229+
channel. We'd love to hear how you're using OpenTelemetry and how we can keep
230+
improving the developer experience together.

static/refcache.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8023,6 +8023,10 @@
80238023
"StatusCode": 206,
80248024
"LastSeen": "2026-03-24T09:53:56.972186001Z"
80258025
},
8026+
"https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/6aff35ab5351482a4664f29a7d5428cedcf61a92/connector/routingconnector?from_branch=main": {
8027+
"StatusCode": 206,
8028+
"LastSeen": "2026-03-16T10:06:04.86792+01:00"
8029+
},
80268030
"https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/72087f655403778da46f4168dca2433fa0775098/receiver/filelogreceiver?from_branch=main": {
80278031
"StatusCode": 206,
80288032
"LastSeen": "2026-03-10T09:52:32.45408359Z"

0 commit comments

Comments
 (0)