log aggregation and observability for SonataFlow

GitHub Actions · jmagak · commit 1cca704b9321 · 2026-04-18T05:56:24.000+02:00
diff --git a/modules/extend_orchestrator-in-rhdh/proc-aggregate-logs-using-the-plg-stack.adoc b/modules/extend_orchestrator-in-rhdh/proc-aggregate-logs-using-the-plg-stack.adoc
@@ -4,17 +4,19 @@
 = Aggregate logs using the Promtail, Loki and Grafana (PLG) stack
 
 [role="_abstract"]
-Deploy and configure a Promtail sidecar to scrape workflow logs and push them to a Loki instance for long-term storage and visualization in Grafana.
+Deploy and configure a Promtail sidecar to scrape workflow logs and push them to a Loki instance for storage and visualization in Grafana.
 
 .Prerequisites
 
-* You have a running Loki and Grafana instance.
+* You have running Loki and Grafana instances in the cluster.
 
 * You have configured workflow for file-based JSON logging.
 
+* You have `cluster-admin` permissions.
+
 .Procedure
 
-. For a quick start, deploy the PLG stack using Helm:
+. Deploy the PLG stack using Helm:
 +
 [source,bash]
 ----
@@ -33,17 +35,18 @@ helm install loki-stack grafana/loki-stack \
   --set promtail.config.logLevel=info \
   --set grafana.enabled=true
 ----
++
+[NOTE]
+====
+For production deployments, use a custom `values.yaml` file with appropriate resource limits and security contexts.
+====
 
-. For production deployment, use the complete Helm values configuration, `../observability/helm-values/` with proper resource limits, security contexts, and OpenShift-specific settings.
-
-. Create a ConfigMap for the Promtail sidecar to parse the JSON logs. You can choose between the following options:
-
-* Scrape container stdout (default)
-* Custom JSON log files
+. Create a ConfigMap for the Promtail sidecar. Select the configuration that matches your logging method:
 
-Scrape Container Stdout (Default)::
 
-This configuration uses Kubernetes service discovery to collect logs from container stdout:
+... Scrape Container Stdout
++
+Use this configuration to collect logs from container stdout using Kubernetes service discovery:
 +
 [source,yaml]
 ----
@@ -96,9 +99,9 @@ data:
           traceId:
 ----
 
-Scrape JSON log files::
-
-When using `[file-based JSON logging](#file-based-json-logging)`, configure Promtail as a sidecar to read from the shared log volume:
+... Scrape JSON log files
++
+If you use `[file-based JSON logging](#file-based-json-logging)`, configure Promtail to read from the shared log volume:
 +
 [source,yaml]
 ----
@@ -194,9 +197,9 @@ spec:
     - name: positions
       emptyDir: {}
 ----
-
-The following are query examples:
-
++
+. Querying logs in Grafana: After deploying the stack, use the following LogQL queries in the Grafana **Explore** view:
++
 .. Filter logs by process instance
 +
 [source,json,subs="+attributes,+quotes"]
@@ -227,9 +230,13 @@ The following are query examples:
 
 .Verification
 
-* In the Grafana *Explore* view, run the following LogQL query to find logs for a specific workflow instance:
+* Access the Grafana **Explore** view.
+
+* Run the following LogQL query, replacing `<instance_id>` with a valid ID:
 +
 [source,json,subs="+attributes,+quotes"]
 ----
-{job="sonataflow-workflows"} | json | processInstanceId="YOUR_INSTANCE_ID"
-----
+{job="sonataflow-workflows"} | json | processInstanceId="<instance_id>"
+----
++
+Expected result: Grafana displays the log entries associated with the specified process instance.
diff --git a/modules/extend_orchestrator-in-rhdh/proc-configure-alerts-for-workflow-health.adoc b/modules/extend_orchestrator-in-rhdh/proc-configure-alerts-for-workflow-health.adoc
@@ -1,20 +1,22 @@
 :_mod-docs-content-type: PROCEDURE
 
 [id="configure-alerts-for-workflow-conditions_{context}"]
-= Configure alerts for critical workflow conditions
+= Configure alerts for workflow conditions
 
 [role="_abstract"]
-Configure alerts to monitor your SonataFlow workflows. These alerts notify you when workflows fail at high rates or when specific process instances are stuck or exceed expected runtimes.
+Configure alerts to monitor SonataFlow workflows. These alerts notify you when workflows fail at high rates, when process instances are stuck, or when runtimes exceed expected thresholds.
 
 .Prerequisites
 
-* You have enabled a structured JSON logging to provide the necessary metadata for LogQL/PromQL queries.
+* You have enabled a structured JSON logging to provide metadata for LogQL and PromQL queries.
+
+* You have installed a monitoring stack, such as Prometheus or Loki with Alertmanager in the cluster.
 
 .Procedure
 
-. Update your configuration with the following rule groups:
+. Create a configuration file containing the following alert rule groups based on your monitoring requirements:
 
-* To monitor failure rates:
+** To monitor failure rates:
 +
 [source,yaml,subs="+quotes,+attributes"]
 ----
@@ -27,7 +29,7 @@ Configure alerts to monitor your SonataFlow workflows. These alerts notify you w
     summary: "High error rate in SonataFlow workflows"
 ----
 
-* To identify stuck process instances:
+** To identify stuck process instances:
 +
 [source,yaml,subs="+quotes,+attributes"]
 ----
@@ -40,7 +42,7 @@ Configure alerts to monitor your SonataFlow workflows. These alerts notify you w
     severity: critical
 ----
 
-* For long-running processes:
+** To identify workflows running for longer durations:
 +
 [source,yaml,subs="+quotes,+attributes"]
 ----
@@ -55,8 +57,10 @@ Configure alerts to monitor your SonataFlow workflows. These alerts notify you w
       summary: "Workflow {{ $labels.process_instance_id }} running longer than 2 hours"
 ----
 
-. Apply the configuration to your cluster.
+. Apply the alert rules to your cluster.
 
 .Verification
 
-* Verify that the alerts appear under the *Alerts* tab.
+* Access the monitoring dashboard, such as the Prometheus or OpenShift Console.
+
+* Verify that the alerts appear in the list under the *Alerts* tab.
diff --git a/modules/extend_orchestrator-in-rhdh/proc-configure-file-based-json-logging-and-log-rotation.adoc b/modules/extend_orchestrator-in-rhdh/proc-configure-file-based-json-logging-and-log-rotation.adoc
@@ -4,17 +4,22 @@
 = Configure file-based JSON logging and log rotation
 
 [role="_abstract"]
-Configure your workflow to emit JSON logs to a file with automatic rotation to support sidecar log collection and prevent disk space issues.
+Configure your workflow to emit JSON logs to a file with automatic rotation. File-based logging enables sidecar log collection and prevents disk space issues in Kubernetes environments.
+
+[IMPORTANT]
+====
+When using file-based logging in Kubernetes, mount the log directory to a volume to prevent data loss or pod instability.
+====
 
 .Prerequisites
 
-* You have configured a shared Kubernetes volume in your `SonataFlow` custom resource.
+* You have configured a shared Kubernetes volume in the `SonataFlow` custom resource.
 
 * Your workflow image includes the JSON logging extension.
 
 .Procedure
 
-. Add the following properties to your workflow ConfigMap to enable file-based JSON output:
+. Add the following properties to the workflow ConfigMap to enable file-based JSON output:
 +
 [source,bash]
 ----
@@ -46,7 +51,7 @@ This configuration does the following:
 quarkus.log.file.level=INFO
 ----
 
-. Update your `SonataFlow` custom resource to mount a volume at the log path:
+. Update the `SonataFlow` custom resource (CR) to mount the volume at the log path:
 +
 [source,bash]
 ----
@@ -62,11 +67,6 @@ spec:
       sizeLimit: 500Mi
 ----
 
-[IMPORTANT]
-====
-If you use file-based logging in Kubernetes, make sure that you mount the log directory.
-====
-
 . After applying the configuration, restart your workflow pod and check the log output:
 +
 [source,bash]
@@ -85,4 +85,11 @@ oc logs -n sonataflow-infra your-workflow-pod-name | head -5
 [source,bash,subs="+attributes,+quotes"]
 ----
 oc exec <pod_name> -- ls -l /var/log/sonataflow/workflow.log
+----
+
+* Verify that the file contains JSON data:
++
+[source,bash,subs="+attributes,+quotes"]
+----
+oc exec <pod_name> -- head -n 5 /var/log/sonataflow/workflow.log
 ----
diff --git a/modules/extend_orchestrator-in-rhdh/proc-correlate-logs-with-opentelemetry-traces.adoc b/modules/extend_orchestrator-in-rhdh/proc-correlate-logs-with-opentelemetry-traces.adoc
@@ -4,11 +4,11 @@
 = Correlate logs with OpenTelemetry traces
 
 [role="_abstract"]
-Integrate OpenTelemetry (OTEL) with your workflow logging to provide end-to-end visibility. This adds `traceId` and `spanId` to your JSON logs, allowing you to navigate from a log entry directly to a distributed trace in your observability tool.
+Integrate OpenTelemetry with your workflow logging to improve diagnostic capabilities. This integration adds `traceId` and `spanId` to your JSON logs, which enables you to navigate from a log entry to a distributed trace in an observability tool.
 
 .Prerequisites
 
-* You have deployed an OpenTelemetry-compliant collector (for example, Jaeger) in your cluster.
+* You have deployed an OpenTelemetry-compliant collector (for example, Jaeger) in the cluster.
 
 * You have set `quarkus.log.console.json.print-details=true` to `true`.
 
@@ -24,22 +24,24 @@ quarkus.otel.exporter.otlp.traces.endpoint=http://jaeger-collector:14268/api/tra
 quarkus.otel.service.name=${workflow.name}
 ----
 
-. Set the resource attributes to help filter traces in your dashboard:
+. Set the resource attributes to filter traces in your observability dashboard:
 +
 [source,yaml,subs="+attributes,+quotes"]
 ----
 quarkus.otel.resource.attributes=service.namespace=sonataflow-infra
 ----
 
-. Restart the workflow pod.
+. Restart the workflow pod to apply the new configuration.
 
 .Verification
 
-* Trigger a workflow execution and check the logs for trace identifiers:
+* Trigger a workflow execution.
+
+* Check the logs for trace identifiers:
 +
 [source,bash,subs="+attributes,+quotes"]
 ----
 oc logs <pod_name> | grep traceId
 ----
 
-* Make sure the `mdc` block in the JSON output now includes `traceId` and `spanId`.
+* Make sure the `mdc` block in the JSON output contains `traceId` and `spanId` fields with non-empty values.
diff --git a/modules/extend_orchestrator-in-rhdh/proc-integrate-workflows-with-external-systems.adoc b/modules/extend_orchestrator-in-rhdh/proc-integrate-workflows-with-external-systems.adoc
@@ -4,15 +4,15 @@
 = Integrate workflows with external systems
 
 [role="_abstract"]
-Integrate SonataFlow workflows with external notification systems like Slack, PagerDuty or email. This ensures that the alerts generated are routed to the correct support teams.
+Integrate SonataFlow workflows with external notification systems, such as Slack, PagerDuty, or email. Routing alerts to external systems ensures that generated notifications reach the appropriate support teams.
 
 .Prerequisites
 
-* A valid webhook URL for your notification service (for example, Slack webhook).
+* You have a valid webhook URL for the notification service (for example, Slack webhook).
 
 .Procedure
 
-. Edit your configuration to define a receiver and a routing path:
+. Define a receiver and a routing path in your Alertmanager configuration:
 +
 [source,yaml,subs="+quotes,+attributes"]
 ----
@@ -32,8 +32,10 @@ receivers:
     text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
 ----
 
-. Reload the configuration.
+. Reload the Alertmanager configuration to apply the changes.
 
 .Verification
 
-* Trigger a test alert and confirm the notification is received in the Slack channel or notification service.
+* Trigger a test alert in your workflow environment.
+
+* Monitor the external notification service (for example, the Slack channel `#workflow-alerts`). A notification appears in the external service containing the summary and details of the triggered alert.
diff --git a/modules/extend_orchestrator-in-rhdh/proc-troubleshoot-observability-and-logging-issues.adoc b/modules/extend_orchestrator-in-rhdh/proc-troubleshoot-observability-and-logging-issues.adoc
diff --git a/modules/extend_orchestrator-in-rhdh/ref-telemetry-configuration-properties.adoc b/modules/extend_orchestrator-in-rhdh/ref-telemetry-configuration-properties.adoc