Monitoring needs for Altinn Studio as a product and internal runtime services


More and more apps and services are being deployed in Altinn Studio, and we are still very immature on the operational
aspects of the platform, both internally and in terms of capabilities offered to service owners.
There have been discussions on this topic previously with mixed results, hopefully now as we are formalizing this a bit we can get

* Clarity on what monitoring capabilities are/will be offered in the container runtime product
* Timelines for planned work

For the rest of this issue we will try to describe what kind of capabilities we need and how we would like to work.
We have two categories of services and monitoring needs on team Studio for the deployments running on container runtimes:

* Apps - we share the runtime with the service owner developers, and are responsible for a subset of the code and telemetry emitted here
* Runtime services - wholly owned by our team

In general some telemetry across these services should be centralized to enable

* Visualization and alerting globally across service owners 
  * Proactive monitoring, we want to discover issues before service owners
* Deepened understanding of the reliability and performance of our services (i.e. what gets measured gets improved. Improvements in software design and architecture over time)
* "Data-driven development": better understanding of how the product and its features are used, enabling better product decisions long term 
* Onboarding for 24/7 monitoring as part of the wider Altinn 3 platform

TLDR: ship it and forget it -> ship it and own it. This is an important transition as we steadily go from feature/migration focus to quality focus in terms of roadmap/planned work

Let's also describe the differences between the categories mentioned

### Runtime services

Runtime services are applications like
- PDF3
- Studio Gateway (integration layer between the Studio "control plane"/designer and the container runtimes Studio support as a product)
- Operator (controllers that reconcile infra)

The service owner monitoring solution will for instance contain telemetry related to the abstractions exposed in the app library (e.g. `IPdfGeneratorClient`), 
but none of the telemetry emitted from the underlying service/implementation (e.g. what we call PDF3) should be available in their monitoring solution (only centralization, across all signals).

Important characteristics

* OTel protocol and (push) export to OTel collector or equivalent agent
* Subset of telemetry can be emitted to service owner monitoring solution (simple up/down kind of metrics, or the built-in kube_ stuff etc is fine)
* Flexible/self-service sampling/configuration
* Enrichment based on source environment (serviceowner, environment etc)

Our workflow:
* OTel SDK instrumentation (or BCL in the case of .NET) for logs, traces, metrics
* Configure SDK for
  * OTLP export to cluster local address
  * Configure sampling, filtering, enrichment, MetricReader interval etc. based on application needs
* Define dashboards in Grafana UI (shared/common instance)
  * Logs and traces in Azure Monitor datasource (unless something else becomes available), metrics in Prometheus 
* Define alerts in Grafana UI (shared/common instance)
  * Alerts to slack? Shared slack channels?
* The state, performance and health of our runtime services can be analyzed through a set of dashboards in the shared/common Grafana instance 

### Apps

Apps are a little more difficult to handle here as the runtime is shared between us and the service owner, and most of the telemetry
is useful/relevant for both us and the service owners.
As previously discussed, it would be prohibitively expensive/wasteful to store the full set of telemetry both centrally and in service owner monitoring solution.
We would like to centralize only a subset of the metrics (you have already proposed and started implementing this for the Prometheus scraping).

Important characteristics

* Self-service (for us, not service owners) selection of which metrics to centralize (some configuration we control)
* An agent that speaks OTLP so we can migrate from Azure Monitor Exporter SDK distribution in .NET to standard OTLP export
* AlwaysOn/ParentBased sampling in OTel SDK, "intelligent" sampling in the agent
* Enrichment based on source environment (serviceowner, environment etc), for the metrics that are centralized

Our workflow:
* OTel SDK instrumentation and configuration in app backend library
  * Switch to OTLP export as opposed to Azure Monitor distribution when app is configured with `UseOpenTelemetry`
* When OTel SDK/spec for browser is stabilized, we should implement that as well 
* Configure and experiment with which metrics to centralize
* Define dashboards in Grafana UI (shared/common instance) based on centralized Prometheus metrics
* Define alerts in Grafana UI (shared/common instance) based on centralized Prometheus metrics
* When deeper analysis is needed (logs, traces), investigate using service owner Grafana/App Insights depending on issue

Service owner workflow:
* Configure OTel SDK as needed based on app needs (enrichment, filtering etc)
* Instrument using BCL abstractions as needed (which are mapped to OTel)
* Most users should have visualization and alerting needs covered out of the box (resources provisioned by Studio as a control plane)
* Grafana UI for advanced needs

### Example design

Architecture for a single environment, e.g. tt02 (I don't think we need centalized across environments)

```
                                            ┌───────────────────────────────────┐
                              ┌────────────►│  Service Owner Azure Monitor /    │
                              │        ┌───►│  Prometheus (full telemetry)      │
                              │        │    └───────────────────────────────────┘
                              │        │
┌─────────┐      ┌────────────┴───┐    │    ┌───────────────────────┐    ┌──────────────────────────┐
│   App   │─────►│ OTel Collector │    │    │ OTel Collector        │    │ Our centralized Azure    │
│ (OTel)  │      │ (Apps)         │────┼───►│ (Centralization)      │───►│ Monitor / Prometheus     │
└─────────┘      └────────────────┘    │    └───────────────────────┘    └──────────────────────────┘
                                       │       filtering: select           dashboards, alerts
                                       │       metrics for                 across all service owners
                                       │       centralization                ▲
                                       │                                     │
┌─────────────────┐    ┌───────────────┴────────┐                            │
│ Runtime Service │───►│ OTel Collector         │────────────────────────────┘
│ (OTel)          │    │ (Runtime Services)     │
└─────────────────┘    └────────────────────────┘
                        all signals centralized,
                        some also to service owner
```

In this scenario, we would have control of the collectors for centralization and runtime services,
but depending on what we do for centralized storage, credentials would have to come from somewhere else.

### Alternatives 

* Build on top of existing/planned features (just saw https://github.com/Altinn/altinn-platform/issues/1454 being completed - unclear what this brings/enables)
* Build or own infra
  * We manage an OTel collector/agent through syncroot or other means
  * We provision our own monitoring solution (maybe LGTM or similar) in our own infrastructure
* Mix of both


### Edits

- Updated diagram to reflect `Subset of telemetry can be emitted to service owner monitoring solution`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring needs for Altinn Studio as a product and internal runtime services #2614

Runtime services

Apps

Example design

Alternatives

Edits

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Monitoring needs for Altinn Studio as a product and internal runtime services #2614

Description

Runtime services

Apps

Example design

Alternatives

Edits

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions