Skip to content

Commit 1cca704

Browse files
GitHub Actionsjmagak
authored andcommitted
log aggregation and observability for SonataFlow
1 parent cdd11ec commit 1cca704

7 files changed

+106
-82
lines changed

modules/extend_orchestrator-in-rhdh/proc-aggregate-logs-using-the-plg-stack.adoc

Lines changed: 27 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,19 @@
44
= Aggregate logs using the Promtail, Loki and Grafana (PLG) stack
55

66
[role="_abstract"]
7-
Deploy and configure a Promtail sidecar to scrape workflow logs and push them to a Loki instance for long-term storage and visualization in Grafana.
7+
Deploy and configure a Promtail sidecar to scrape workflow logs and push them to a Loki instance for storage and visualization in Grafana.
88

99
.Prerequisites
1010

11-
* You have a running Loki and Grafana instance.
11+
* You have running Loki and Grafana instances in the cluster.
1212

1313
* You have configured workflow for file-based JSON logging.
1414

15+
* You have `cluster-admin` permissions.
16+
1517
.Procedure
1618

17-
. For a quick start, deploy the PLG stack using Helm:
19+
. Deploy the PLG stack using Helm:
1820
+
1921
[source,bash]
2022
----
@@ -33,17 +35,18 @@ helm install loki-stack grafana/loki-stack \
3335
--set promtail.config.logLevel=info \
3436
--set grafana.enabled=true
3537
----
38+
+
39+
[NOTE]
40+
====
41+
For production deployments, use a custom `values.yaml` file with appropriate resource limits and security contexts.
42+
====
3643

37-
. For production deployment, use the complete Helm values configuration, `../observability/helm-values/` with proper resource limits, security contexts, and OpenShift-specific settings.
38-
39-
. Create a ConfigMap for the Promtail sidecar to parse the JSON logs. You can choose between the following options:
40-
41-
* Scrape container stdout (default)
42-
* Custom JSON log files
44+
. Create a ConfigMap for the Promtail sidecar. Select the configuration that matches your logging method:
4345

44-
Scrape Container Stdout (Default)::
4546

46-
This configuration uses Kubernetes service discovery to collect logs from container stdout:
47+
... Scrape Container Stdout
48+
+
49+
Use this configuration to collect logs from container stdout using Kubernetes service discovery:
4750
+
4851
[source,yaml]
4952
----
@@ -96,9 +99,9 @@ data:
9699
traceId:
97100
----
98101

99-
Scrape JSON log files::
100-
101-
When using `[file-based JSON logging](#file-based-json-logging)`, configure Promtail as a sidecar to read from the shared log volume:
102+
... Scrape JSON log files
103+
+
104+
If you use `[file-based JSON logging](#file-based-json-logging)`, configure Promtail to read from the shared log volume:
102105
+
103106
[source,yaml]
104107
----
@@ -194,9 +197,9 @@ spec:
194197
- name: positions
195198
emptyDir: {}
196199
----
197-
198-
The following are query examples:
199-
200+
+
201+
. Querying logs in Grafana: After deploying the stack, use the following LogQL queries in the Grafana **Explore** view:
202+
+
200203
.. Filter logs by process instance
201204
+
202205
[source,json,subs="+attributes,+quotes"]
@@ -227,9 +230,13 @@ The following are query examples:
227230

228231
.Verification
229232

230-
* In the Grafana *Explore* view, run the following LogQL query to find logs for a specific workflow instance:
233+
* Access the Grafana **Explore** view.
234+
235+
* Run the following LogQL query, replacing `<instance_id>` with a valid ID:
231236
+
232237
[source,json,subs="+attributes,+quotes"]
233238
----
234-
{job="sonataflow-workflows"} | json | processInstanceId="YOUR_INSTANCE_ID"
235-
----
239+
{job="sonataflow-workflows"} | json | processInstanceId="<instance_id>"
240+
----
241+
+
242+
Expected result: Grafana displays the log entries associated with the specified process instance.

modules/extend_orchestrator-in-rhdh/proc-configure-alerts-for-workflow-health.adoc

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
11
:_mod-docs-content-type: PROCEDURE
22

33
[id="configure-alerts-for-workflow-conditions_{context}"]
4-
= Configure alerts for critical workflow conditions
4+
= Configure alerts for workflow conditions
55

66
[role="_abstract"]
7-
Configure alerts to monitor your SonataFlow workflows. These alerts notify you when workflows fail at high rates or when specific process instances are stuck or exceed expected runtimes.
7+
Configure alerts to monitor SonataFlow workflows. These alerts notify you when workflows fail at high rates, when process instances are stuck, or when runtimes exceed expected thresholds.
88

99
.Prerequisites
1010

11-
* You have enabled a structured JSON logging to provide the necessary metadata for LogQL/PromQL queries.
11+
* You have enabled a structured JSON logging to provide metadata for LogQL and PromQL queries.
12+
13+
* You have installed a monitoring stack, such as Prometheus or Loki with Alertmanager in the cluster.
1214

1315
.Procedure
1416

15-
. Update your configuration with the following rule groups:
17+
. Create a configuration file containing the following alert rule groups based on your monitoring requirements:
1618

17-
* To monitor failure rates:
19+
** To monitor failure rates:
1820
+
1921
[source,yaml,subs="+quotes,+attributes"]
2022
----
@@ -27,7 +29,7 @@ Configure alerts to monitor your SonataFlow workflows. These alerts notify you w
2729
summary: "High error rate in SonataFlow workflows"
2830
----
2931

30-
* To identify stuck process instances:
32+
** To identify stuck process instances:
3133
+
3234
[source,yaml,subs="+quotes,+attributes"]
3335
----
@@ -40,7 +42,7 @@ Configure alerts to monitor your SonataFlow workflows. These alerts notify you w
4042
severity: critical
4143
----
4244

43-
* For long-running processes:
45+
** To identify workflows running for longer durations:
4446
+
4547
[source,yaml,subs="+quotes,+attributes"]
4648
----
@@ -55,8 +57,10 @@ Configure alerts to monitor your SonataFlow workflows. These alerts notify you w
5557
summary: "Workflow {{ $labels.process_instance_id }} running longer than 2 hours"
5658
----
5759

58-
. Apply the configuration to your cluster.
60+
. Apply the alert rules to your cluster.
5961

6062
.Verification
6163

62-
* Verify that the alerts appear under the *Alerts* tab.
64+
* Access the monitoring dashboard, such as the Prometheus or OpenShift Console.
65+
66+
* Verify that the alerts appear in the list under the *Alerts* tab.

modules/extend_orchestrator-in-rhdh/proc-configure-file-based-json-logging-and-log-rotation.adoc

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,22 @@
44
= Configure file-based JSON logging and log rotation
55

66
[role="_abstract"]
7-
Configure your workflow to emit JSON logs to a file with automatic rotation to support sidecar log collection and prevent disk space issues.
7+
Configure your workflow to emit JSON logs to a file with automatic rotation. File-based logging enables sidecar log collection and prevents disk space issues in Kubernetes environments.
8+
9+
[IMPORTANT]
10+
====
11+
When using file-based logging in Kubernetes, mount the log directory to a volume to prevent data loss or pod instability.
12+
====
813

914
.Prerequisites
1015

11-
* You have configured a shared Kubernetes volume in your `SonataFlow` custom resource.
16+
* You have configured a shared Kubernetes volume in the `SonataFlow` custom resource.
1217

1318
* Your workflow image includes the JSON logging extension.
1419

1520
.Procedure
1621

17-
. Add the following properties to your workflow ConfigMap to enable file-based JSON output:
22+
. Add the following properties to the workflow ConfigMap to enable file-based JSON output:
1823
+
1924
[source,bash]
2025
----
@@ -46,7 +51,7 @@ This configuration does the following:
4651
quarkus.log.file.level=INFO
4752
----
4853

49-
. Update your `SonataFlow` custom resource to mount a volume at the log path:
54+
. Update the `SonataFlow` custom resource (CR) to mount the volume at the log path:
5055
+
5156
[source,bash]
5257
----
@@ -62,11 +67,6 @@ spec:
6267
sizeLimit: 500Mi
6368
----
6469

65-
[IMPORTANT]
66-
====
67-
If you use file-based logging in Kubernetes, make sure that you mount the log directory.
68-
====
69-
7070
. After applying the configuration, restart your workflow pod and check the log output:
7171
+
7272
[source,bash]
@@ -85,4 +85,11 @@ oc logs -n sonataflow-infra your-workflow-pod-name | head -5
8585
[source,bash,subs="+attributes,+quotes"]
8686
----
8787
oc exec <pod_name> -- ls -l /var/log/sonataflow/workflow.log
88+
----
89+
90+
* Verify that the file contains JSON data:
91+
+
92+
[source,bash,subs="+attributes,+quotes"]
93+
----
94+
oc exec <pod_name> -- head -n 5 /var/log/sonataflow/workflow.log
8895
----

modules/extend_orchestrator-in-rhdh/proc-correlate-logs-with-opentelemetry-traces.adoc

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@
44
= Correlate logs with OpenTelemetry traces
55

66
[role="_abstract"]
7-
Integrate OpenTelemetry (OTEL) with your workflow logging to provide end-to-end visibility. This adds `traceId` and `spanId` to your JSON logs, allowing you to navigate from a log entry directly to a distributed trace in your observability tool.
7+
Integrate OpenTelemetry with your workflow logging to improve diagnostic capabilities. This integration adds `traceId` and `spanId` to your JSON logs, which enables you to navigate from a log entry to a distributed trace in an observability tool.
88

99
.Prerequisites
1010

11-
* You have deployed an OpenTelemetry-compliant collector (for example, Jaeger) in your cluster.
11+
* You have deployed an OpenTelemetry-compliant collector (for example, Jaeger) in the cluster.
1212

1313
* You have set `quarkus.log.console.json.print-details=true` to `true`.
1414

@@ -24,22 +24,24 @@ quarkus.otel.exporter.otlp.traces.endpoint=http://jaeger-collector:14268/api/tra
2424
quarkus.otel.service.name=${workflow.name}
2525
----
2626

27-
. Set the resource attributes to help filter traces in your dashboard:
27+
. Set the resource attributes to filter traces in your observability dashboard:
2828
+
2929
[source,yaml,subs="+attributes,+quotes"]
3030
----
3131
quarkus.otel.resource.attributes=service.namespace=sonataflow-infra
3232
----
3333

34-
. Restart the workflow pod.
34+
. Restart the workflow pod to apply the new configuration.
3535

3636
.Verification
3737

38-
* Trigger a workflow execution and check the logs for trace identifiers:
38+
* Trigger a workflow execution.
39+
40+
* Check the logs for trace identifiers:
3941
+
4042
[source,bash,subs="+attributes,+quotes"]
4143
----
4244
oc logs <pod_name> | grep traceId
4345
----
4446

45-
* Make sure the `mdc` block in the JSON output now includes `traceId` and `spanId`.
47+
* Make sure the `mdc` block in the JSON output contains `traceId` and `spanId` fields with non-empty values.

modules/extend_orchestrator-in-rhdh/proc-integrate-workflows-with-external-systems.adoc

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44
= Integrate workflows with external systems
55

66
[role="_abstract"]
7-
Integrate SonataFlow workflows with external notification systems like Slack, PagerDuty or email. This ensures that the alerts generated are routed to the correct support teams.
7+
Integrate SonataFlow workflows with external notification systems, such as Slack, PagerDuty, or email. Routing alerts to external systems ensures that generated notifications reach the appropriate support teams.
88

99
.Prerequisites
1010

11-
* A valid webhook URL for your notification service (for example, Slack webhook).
11+
* You have a valid webhook URL for the notification service (for example, Slack webhook).
1212

1313
.Procedure
1414

15-
. Edit your configuration to define a receiver and a routing path:
15+
. Define a receiver and a routing path in your Alertmanager configuration:
1616
+
1717
[source,yaml,subs="+quotes,+attributes"]
1818
----
@@ -32,8 +32,10 @@ receivers:
3232
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
3333
----
3434

35-
. Reload the configuration.
35+
. Reload the Alertmanager configuration to apply the changes.
3636

3737
.Verification
3838

39-
* Trigger a test alert and confirm the notification is received in the Slack channel or notification service.
39+
* Trigger a test alert in your workflow environment.
40+
41+
* Monitor the external notification service (for example, the Slack channel `#workflow-alerts`). A notification appears in the external service containing the summary and details of the triggered alert.

0 commit comments

Comments
 (0)