Status as of ~06:45 UTC, after a full overnight pass on coder-observability + RHAIIS verification.
coder-observabilityis up. All 16 pods Running. Grafana reachable athttps://graf-coder.apps.cluster.rhsummit.coderdemo.io. GitHub OAuth wired with admin-team → Grafana Admin role mapping. Same demo-rhsummit-users org gate as Coder (decision #20) and OpenShift (decision #21).rhaiiswas already healthy. vLLM running on the GPU node (ip-10-0-7-41),ibm-granite/granite-3.1-8b-instructloaded,/v1/chat/completionsreturns coherent answers. No changes needed.- All other Argo apps Synced/Healthy (with two cosmetic OutOfSync flags noted below).
- Decisions doc updated — new §22 covers the full k3s → OCP translation work for observability.
- URL: https://graf-coder.apps.cluster.rhsummit.coderdemo.io
- Login flow: GitHub OAuth →
demo-rhsummit-usersorg members admitted;adminteam members get Grafana Admin role automatically; everyone else lands as Viewer. - Dashboards present: the Coder bundle (coderd, provisionerd, workspaces, prebuilds, agent-boundaries, AI Bridge, status), plus Loki dashboards and the runbook viewer.
- Datasources: Prometheus (in-cluster), Loki (in-cluster), Infinity (for the AI Bridge dashboard's litellm pricing-data lookup).
gitops/apps/observability/application.yaml — comprehensive k3s→OCP translation
manifests/observability/route.yaml — host=graf-coder, service=grafana, port=service
manifests/observability/certificate.yaml — graf-coder.apps... (was grafana.apps...)
manifests/observability/grafana-agent-scc.yaml — NEW; binds agent SA to hostmount-anyuid
manifests/secrets/grafana-github-oauth.yaml — NEW; sealed OAuth client id/secret
docs/decisions.md — new §22 (full translation rationale)
MORNING_REPORT.md — this file
Git log overnight:
78bff88 fix(observability): bind grafana-agent SA to hostmount-anyuid SCC
bf98080 fix(observability): Route -> grafana svc (not coder-observability-grafana)
ec457de fix(observability): alertmanager container-level SC + minio SC + minio memory
fa954ca fix(observability): mirror memcached + disable memcached-exporter (DH limit)
9380ba9 fix(observability): point loki-gateway DNS resolver at OpenShift CoreDNS
cb2f0f5 fix(observability): pin grafana UID + disable initChownData (chart YAML error)
a29b0f3 fix(observability): pin alertmanager+kube-state-metrics UID (schema rejects null)
43e97c9 feat(observability): translate k3s coder-observability config to OCP
(Followed by a docs commit landing decision §22 — coming next.)
The chart had been OutOfSync/Degraded for ~5 hours. Five distinct issues, each rejecting a different subcomponent:
-
Restricted-v2 SCC vs hardcoded UIDs. Every subchart hardcoded a runAsUser outside the namespace's uid-range (
grafana472,prometheus.server65534,prometheus.alertmanager65534,prometheus.kube-state-metrics65534,loki.loki10001,loki.gateway101,loki.minio1000). Two patterns to fix: nullify (the SCC mutating admission then injects a valid UID) for the schema-permissive subcharts, or pin to1000800000(start of the allowed range) for the schema-strict ones. Detailed per-subchart inapplication.yamlcomments and decision §22. -
Grafana init-chown-data init container uses
.Values.securityContext.runAsUserinline in itschowncommand. With null, the rendered StatefulSet hadchown -R : /var/lib/grafana, which fails Helm's YAML parse. Fix: pin grafana UID + disable the init container entirely (OCP's CSI driver chgrps with the SCC-injected fsGroup; manual chown is redundant). -
Loki gateway DNS resolver. The nginx config embeds
<dnsService>.<dnsNamespace>.svc.cluster.local. Defaults arekube-dns.kube-system(vanilla k8s); OCP's CoreDNS isdns-default.openshift-dns. Overrideloki.global.{dnsService,dnsNamespace}. -
Docker Hub rate limits on three images:
memcached→mirror.gcr.io/library/memcached.prom/memcached-exporter→ no published mirror, disabled.burningalchemist/sql_exporter→ no published mirror, disabled. -
grafana-agent hostPath blocked by restricted-v2. The DS ships node logs to Loki via hostPath mounts. New manifest
manifests/observability/grafana-agent-scc.yamlbinds the agent SA tohostmount-anyuid. Standard OCP pattern for log shippers — narrower thanprivileged, broader thanrestricted-v2.
Plus two smaller things:
- The Route was pointed at the wrong Service name (
coder-observability-grafanavs the actualgrafana) and wrong port name (http-webvsservice). - minio's chart-default 16Gi memory request would have blocked scheduling on the converged cluster — trimmed to 1Gi for the demo.
These three things are knowingly off and called out in application.yaml + docs/decisions.md §22 → "Tradeoffs":
-
sqlExporter.enabled: false— Coder-specific Postgres SQL metrics. Image is DH-only with no upstream mirror; either we mirror it ourselves to GHCR/Quay or accept the gap. Coder dashboards still work because Prometheus scrapes Coder pods directly. -
global.postgres.exporter.enabled: false— generic Postgres metrics. Chart wantssecret-postgreswith PGUSER/PGPASSWORD-shaped keys. CNPG'scoder-appSecret has different keys (uri/host/port/dbname/user/password/jdbc-uri). Need a translation Secret or a small Job that materializes one. Postgres-specific Grafana panels will be empty until this is wired. -
loki.memcachedExporter.enabled: false— observability of the memcached caches themselves. Image is DH-only.
Already running, no changes required.
Pod: vllm-77b6bb9dfd-2mb2l on ip-10-0-7-41.ec2.internal (us-east-1a, GPU worker)
Image: quay.io/modh/vllm:rhoai-2.20-cuda # no subscription needed
Model: ibm-granite/granite-3.1-8b-instruct
Tool parser: granite (--enable-auto-tool-choice)
Endpoint: POST http://vllm.ocp-ai.svc.cluster.local:8000/v1/chat/completions
End-to-end smoke test passed (curl /v1/chat/completions returns coherent answer). The image path uses quay.io/modh/vllm (Red Hat's open-access publication) rather than registry.redhat.io so we don't need the RHOAI subscription SKU on the partner pull secret — that was the call you flagged as "the other requires a sub I dont want to deal with."
NAME SYNC STATUS HEALTH STATUS
cert-manager Synced Healthy
cluster-config Synced Healthy
coder Synced Healthy
coder-observability OutOfSync Progressing [see "cosmetic OutOfSync" below]
coder-provisioner Synced Healthy
coder-routing OutOfSync Healthy
gpu-stack Synced Healthy
group-sync-operator Synced Healthy
platform-secrets Synced Healthy
postgres Synced Healthy
rhaiis Synced Healthy
root Synced Healthy
sealed-secrets OutOfSync Healthy
All workloads I touched are functionally correct.
sealed-secrets— Argo's diff-cache shows a phantom diff on the Bitnami Helm Deployment;argocd app diffreturns empty. Documented in earlier sealed-secrets rollout notes.coder-observability— three sub-resources flag OutOfSync:PersistentVolumeClaim/grafana(Pending) — orphan PVC the chart's grafana template renders alongside the StatefulSet's volumeClaimTemplate. The SS'sstorage-grafana-0is the real Bound PVC; this leftover doesn't bind because nothing claims it. I tried deleting it, Argo recreates it (it's part of the chart's render). Functionally inert.- 3 StatefulSets (alertmanager, grafana, loki-storage) —
OutOfSync Healthyfrom SSA managedFields after our manual delete-and-recreate cycle. They self-heal Healthy; the OutOfSync label is cosmetic and matches the same pattern we hit on sealed-secrets. - kube-state-metrics ClusterRole/ClusterRoleBinding — Argo tracking-id label diff on cluster-scoped objects.
coder-routing— pre-existing, not touched this session.
- Mirror sql-exporter and memcached-exporter to GHCR (a small
Jobper image using skopeo, or commit-time CI to a registry we own). Re-enable both. Restores the metrics gap. - Translate
coder-appSecret →secret-postgresso the postgres-exporter can connect. Either a one-shot init Job that reads CNPG's auto-generated values and materializes a properly-shaped Secret incoder-observability, or extend the Coder helm install to add the translation as a Helm hook. - Switch grafana-agent to OpenShift Logging. Replace the chart's agent + hostmount-anyuid SCC binding with the cluster-logging operator's Vector/LokiStack pipeline. Preserves Loki for queries; uses operator-managed SCCs.
- Cleanup —
git rm docs/sealed-secrets-rollout.md(the one-shot runbook from earlier today).
export KUBECONFIG=/tmp/kubeconfig
# 1. All pods running?
oc get pods -n coder-observability
# expected: 16 pods Running
# 2. Grafana login page
curl -sf -o /dev/null -w "HTTP %{http_code}\n" \
https://graf-coder.apps.cluster.rhsummit.coderdemo.io/api/health
# expected: HTTP 200
# 3. RHAIIS smoke test
oc apply -n ocp-ai -f - <<'YAML'
apiVersion: v1
kind: Pod
metadata: { name: vllm-smoketest, namespace: ocp-ai }
spec:
restartPolicy: Never
containers:
- name: c
image: curlimages/curl:latest
command: ["sleep","60"]
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
capabilities: { drop: [ALL] }
seccompProfile: { type: RuntimeDefault }
YAML
oc wait pod/vllm-smoketest -n ocp-ai --for=condition=Ready --timeout=30s
oc exec -n ocp-ai vllm-smoketest -- curl -sf -X POST \
http://vllm.ocp-ai.svc.cluster.local:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"ibm-granite/granite-3.1-8b-instruct","messages":[{"role":"user","content":"hi"}],"max_tokens":10}'
oc delete pod vllm-smoketest -n ocp-ai --grace-period=1
# expected: a JSON response with content from the model
# 4. Argo apps overview
oc get applications -n openshift-gitops- The Grafana OAuth client secret you pasted in chat for the script (
f37679...) is now sealed inmanifests/secrets/grafana-github-oauth.yaml. The plaintext was wiped from disk. The chat history still contains it — recommend rotating in GitHub post-demo. - The Coder API token (
vTVuC...) that you pasted earlier for me to mint the provisioner key — same recommendation. - The provisioner key plaintext briefly leaked into output when I extracted it (the new Coder CLI prints it on a bare line). Sealed manifest is committed; rotation via
coder provisioner keys delete rhsummit-demo+ re-mint is a one-liner if you want to be tidy.
Sleep well.