AI-Powered On-Call Support Agent built with Embabel on Spring Boot 4 #1491
Puneethkumarck
started this conversation in
Show and tell
Replies: 1 comment
-
|
@johnsonr Fyi |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Embabel community! I wanted to share what we built over the past week — a production incident response agent that automates the first 15–30 minutes of manual investigation when an on-call engineer gets paged.
We fully validated it yesterday against a live microservice stack with real PagerDuty incidents, real Prometheus/Loki/ArgoCD data, and real LLM calls. 10 out of 11 tests passed. Here's how it's built and what we learned.
What it does
When a PagerDuty alert fires, the agent automatically:
The on-call engineer receives a structured report in ~10 seconds instead of spending 15–30 minutes opening dashboards manually.
Agent architecture
We built 8 GOAP agents, each responsible for a distinct incident response workflow:
ServiceHealthAgentIncidentTriageAgentSLOMonitorAgentLogAnalysisAgentTraceAnalysisAgentDeployImpactAgentAlertFatigueAgentPostMortemAgentDeployRollbackAgentThe GOAP planning model maps really naturally to incident response: each action gathers one piece of evidence, and the agent plans a path through data collection to reach its goal. Here's what a typical agent looks like:
The typed blackboard records are a great fit here — each data-gathering step declares exactly what it produces and what it needs, and Embabel's planner figures out the execution order automatically. No manually wired pipelines.
Personas
We defined 5
RoleGoalBackstorypersonas that shape LLM behaviour for different tasks:This separation of persona by task meaningfully changes the LLM output. The
SENIOR_SREpersona cuts to "ONE specific actionable remediation". TheINCIDENT_COMMANDERpersona produces structured, blameless post-mortems with P1/P2/P3 action items and due dates.Human-in-the-loop: DeployRollbackAgent (design, not yet validated)
We built a
DeployRollbackAgentthat usesWaitFor.confirmation()as an approval gate before executing against ArgoCD. We haven't tested it end-to-end yet — including it here because the design feels like the right fit forWaitForand we'd love feedback from anyone who has used it in a similar pattern.The idea: the agent assesses rollback risk via LLM, then blocks until a human approves before touching ArgoCD.
The GOAP dependency model handles the gate cleanly:
executeRollbackrequiresApprovedRollbackPlan, which only exists after the human approves. No special-case orchestration code needed.What we haven't validated yet: how
WaitFor.confirmation()behaves when the agent is triggered via a PagerDuty webhook (i.e. not from an interactive shell session). If anyone has experience routing the approval prompt through Slack or a webhook callback in this scenario, we'd be interested to hear how you approached it.The Spring Boot 4 / Spring AI incompatibility
This was our biggest technical challenge, and I want to flag it for others who hit it.
The problem: Embabel 0.3.4 bundles Spring AI 1.x, which calls
HttpHeaders.addAll(MultiValueMap)— a method removed in Spring Framework 7 (which ships with Spring Boot 4). Both thespring-ai-anthropicandspring-ai-openaistarters throwNoSuchMethodErrorat startup. The Ollama starter is not affected because it takes a different HTTP client path.Our workaround: A lightweight Node.js proxy that accepts Ollama API format and forwards to the Anthropic API:
The proxy also:
"response_format": {"type": "json_object"}on every requestWe also had to set a custom Netty response timeout and increase the Embabel LLM operation timeout to 300s for complex prompts:
This is a workaround we're carrying until Embabel ships native Spring Boot 4 / Spring AI 2.0 GA support. Watching the repo for that.
LLM JSON reliability
One pattern we hit repeatedly: when asking the LLM to populate Java records with nested types, it sometimes:
false) where a complex object is expected (e.g.DeployCorrelation)"postMortemDraft": { ... }instead of the object directly)affectedUsersEstimatevsaffectedUsers)"~5 minutes") instead of ISO-8601We solved this with custom Jackson deserializers on the affected records:
The deserializer unwraps the nested object if present, normalises field names, and parses freeform duration strings. For records the LLM populates with 3+ nested fields, a custom deserializer is now our default approach.
Validation results
Tested against a live price-alert platform (5 Spring Boot microservices, Kafka, PostgreSQL, Prometheus, Loki, Tempo, ArgoCD, PagerDuty EU):
25268as correlated, specific rollback recommendation2526810/11 pass on the agents we tested.
DeployRollbackAgentis implemented but not yet validated end-to-end. End-to-end response time is 2–10 seconds per agent (data fetch ~0.3s + LLM ~2–9s).Lessons learned
GOAP is the right model for incident workflows. Evidence gathering maps naturally to actions, each action declares its typed inputs and outputs, and the planner composes them automatically. We didn't write any orchestration code — just actions.
Typed blackboard records are ergonomic. Defining inner records per agent (
LogSnapshot,MetricsData,DeployData) keeps the type graph readable and gives Embabel enough information to plan correctly without ambiguity.Graceful degradation is the hardest part. When Prometheus, ArgoCD, or PagerDuty are unavailable, the agent must return empty defaults rather than crashing. We fixed 7 resilience issues during testing — all of the same shape: an adapter throwing an exception instead of returning empty.
Personas meaningfully change output quality. The difference between
SENIOR_SRE(terse, actionable, MTTR-focused) andINCIDENT_COMMANDER(structured, blameless, time-bound action items) is significant. Worth spending time on persona wording.Repos
OnCall Support Agent (the Embabel agent — this project):
https://github.com/Puneethkumarck/oncall-support-agent
Price Alert (the target microservice platform used for all validation testing — 5 Spring Boot services, Kafka, PostgreSQL, Prometheus, Loki, Tempo, Grafana, ArgoCD):
https://github.com/Puneethkumarck/price-alert
Stack: Java 25 · Spring Boot 4.0.3 · Embabel 0.3.4 · Claude Haiku 4.5 · Prometheus · Loki · Tempo · ArgoCD · PagerDuty
Happy to answer questions about the GOAP action design, the LLM JSON patterns, or the Spring Boot 4 workaround.
Beta Was this translation helpful? Give feedback.
All reactions