AI-Powered On-Call Support Agent built with Embabel on Spring Boot 4 #1491

Puneethkumarck · 2026-03-13T07:19:11Z

Puneethkumarck
Mar 13, 2026

Hi Embabel community! I wanted to share what we built over the past week — a production incident response agent that automates the first 15–30 minutes of manual investigation when an on-call engineer gets paged.

We fully validated it yesterday against a live microservice stack with real PagerDuty incidents, real Prometheus/Loki/ArgoCD data, and real LLM calls. 10 out of 11 tests passed. Here's how it's built and what we learned.

What it does

When a PagerDuty alert fires, the agent automatically:

Fetches current metrics from Prometheus (error rate, latency, throughput, CPU, memory)
Queries recent error logs from Loki
Pulls deploy history from ArgoCD
Retrieves alert history from PagerDuty
Sends everything to Claude Haiku to synthesize a triage report

The on-call engineer receives a structured report in ~10 seconds instead of spending 15–30 minutes opening dashboards manually.

Agent architecture

We built 8 GOAP agents, each responsible for a distinct incident response workflow:

Agent	Embabel Goal	What it produces
`ServiceHealthAgent`	Assess service health	RED/AMBER/GREEN health card with SLI/SLO data
`IncidentTriageAgent`	Triage incident	Severity, blast radius, root cause, remediation
`SLOMonitorAgent`	Monitor error budget	Burn rate, budget remaining, deploy correlation
`LogAnalysisAgent`	Analyze error patterns	Clustered error patterns from Loki
`TraceAnalysisAgent`	Find bottlenecks	Distributed trace analysis from Tempo
`DeployImpactAgent`	Assess deploy impact	Pre/post metric comparison, rollback recommendation
`AlertFatigueAgent`	Reduce alert noise	Noise %, duplicate groups, tuning recommendations
`PostMortemAgent`	Draft post-mortem	Blameless post-mortem with prioritised action items
`DeployRollbackAgent`	Execute rollback	Human-in-the-loop approval gate → ArgoCD rollback (built, not yet tested)

The GOAP planning model maps really naturally to incident response: each action gathers one piece of evidence, and the agent plans a path through data collection to reach its goal. Here's what a typical agent looks like:

@Agent(description = "Auto-triage production incidents by gathering logs, metrics, deploys, "
    + "alert history, and annotations to produce a pre-investigated incident summary")
public class IncidentTriageAgent {

    // Blackboard state — each action produces a typed record
    public record LogSnapshot(List<LogCluster> clusters) {}
    public record MetricsData(MetricsSnapshot snapshot) {}
    public record DeployData(DeploySnapshot snapshot) {}
    public record AlertHistoryData(AlertHistorySnapshot snapshot) {}
    public record AnnotationData(List<String> annotations) {}

    @Action(description = "Parse alert from user input into structured AlertContext")
    public AlertContext parseAlert(UserInput userInput) { ... }

    @Action(description = "Fetch recent error logs from Loki for the alerted service")
    public LogSnapshot fetchRecentLogs(AlertContext alert) { ... }

    @Action(description = "Fetch current service metrics from Prometheus")
    public MetricsData fetchServiceMetrics(AlertContext alert) { ... }

    @Action(description = "Fetch recent deploy history from ArgoCD")
    public DeployData fetchRecentDeploys(AlertContext alert) { ... }

    @Action(description = "Analyze all gathered data and produce incident assessment")
    public IncidentAssessment triageAndAnalyze(
            AlertContext alert, LogSnapshot logs, MetricsData metrics,
            DeployData deploy, AlertHistoryData alertHistory,
            AnnotationData annotations, Ai ai) {
        return ai.withDefaultLlm()
                .withPromptContributor(OnCallPersonas.SENIOR_SRE)
                .creating(IncidentAssessment.class)
                .fromPrompt(prompt);
    }

    @AchievesGoal(description = "Produce formatted incident triage report",
        export = @Export(name = "triageIncident", remote = true,
                         startingInputTypes = {UserInput.class}))
    @Action(description = "Format triage report as markdown and prepare for notification")
    public FormattedTriageReport formatAndNotify(
            IncidentAssessment assessment, AlertContext alert,
            LogSnapshot logs, MetricsData metrics, DeployData deploy) { ... }
}

The typed blackboard records are a great fit here — each data-gathering step declares exactly what it produces and what it needs, and Embabel's planner figures out the execution order automatically. No manually wired pipelines.

Personas

We defined 5 RoleGoalBackstory personas that shape LLM behaviour for different tasks:

public static final RoleGoalBackstory SENIOR_SRE = new RoleGoalBackstory(
    "Senior Site Reliability Engineer",
    "Identify the single root cause of production incidents and provide one specific actionable remediation",
    "SRE with 12 years experience operating large-scale distributed systems. Expert at correlating "
        + "Prometheus metrics, log patterns, and ArgoCD deployment history to determine if issues are "
        + "deploy-related, infrastructure-related, or dependency-related. Prioritizes MTTR over perfection."
);

public static final RoleGoalBackstory INCIDENT_COMMANDER = new RoleGoalBackstory(
    "Incident Commander",
    "Produce a clear, blameless post-mortem with specific action items",
    "Incident management lead with 10 years experience. Writes blameless post-mortems focused on "
        + "systemic improvements. Ensures every action item is specific, assigned, and time-bound."
);

This separation of persona by task meaningfully changes the LLM output. The SENIOR_SRE persona cuts to "ONE specific actionable remediation". The INCIDENT_COMMANDER persona produces structured, blameless post-mortems with P1/P2/P3 action items and due dates.

Human-in-the-loop: DeployRollbackAgent (design, not yet validated)

We built a DeployRollbackAgent that uses WaitFor.confirmation() as an approval gate before executing against ArgoCD. We haven't tested it end-to-end yet — including it here because the design feels like the right fit for WaitFor and we'd love feedback from anyone who has used it in a similar pattern.

The idea: the agent assesses rollback risk via LLM, then blocks until a human approves before touching ArgoCD.

@Action(description = "Request human approval before executing rollback")
public ApprovedRollbackPlan requestHumanApproval(RollbackPlan plan) {
    String message = String.format("""
        ROLLBACK APPROVAL REQUIRED

        Service: %s
        Current Revision: %s
        Target Revision: %s

        Risk: %s
        Expected Impact: %s

        Do you approve this rollback?""",
        plan.service(), plan.currentRevision(), plan.targetRevision(),
        plan.riskSummary(), plan.expectedImpact());

    RollbackPlan confirmed = WaitFor.confirmation(plan, message);
    return new ApprovedRollbackPlan(confirmed);
}

@Action(description = "Execute rollback via ArgoCD after human approval")
public RollbackExecutionResult executeRollback(ApprovedRollbackPlan approved) {
    return new RollbackExecutionResult(
        deployRollbackProvider.executeRollback(
            approved.plan().service(), approved.plan().targetRevision()));
}

The GOAP dependency model handles the gate cleanly: executeRollback requires ApprovedRollbackPlan, which only exists after the human approves. No special-case orchestration code needed.

What we haven't validated yet: how WaitFor.confirmation() behaves when the agent is triggered via a PagerDuty webhook (i.e. not from an interactive shell session). If anyone has experience routing the approval prompt through Slack or a webhook callback in this scenario, we'd be interested to hear how you approached it.

The Spring Boot 4 / Spring AI incompatibility

This was our biggest technical challenge, and I want to flag it for others who hit it.

The problem: Embabel 0.3.4 bundles Spring AI 1.x, which calls HttpHeaders.addAll(MultiValueMap) — a method removed in Spring Framework 7 (which ships with Spring Boot 4). Both the spring-ai-anthropic and spring-ai-openai starters throw NoSuchMethodError at startup. The Ollama starter is not affected because it takes a different HTTP client path.

Our workaround: A lightweight Node.js proxy that accepts Ollama API format and forwards to the Anthropic API:

Spring Boot App → Ollama API format → :11435 (Node proxy) → Anthropic API → Claude Haiku 4.5

The proxy also:

Forces "response_format": {"type": "json_object"} on every request
Strips markdown code fences from responses before returning them (Claude wraps JSON in ```json blocks by default)
Uses progressive streaming to keep the Netty connection alive during long LLM calls (we hit a 10s Netty read timeout on PostMortem prompts before switching to streaming)

We also had to set a custom Netty response timeout and increase the Embabel LLM operation timeout to 300s for complex prompts:

embabel:
  agent:
    platform:
      llm-operations:
        prompts:
          default-timeout: 300s

This is a workaround we're carrying until Embabel ships native Spring Boot 4 / Spring AI 2.0 GA support. Watching the repo for that.

LLM JSON reliability

One pattern we hit repeatedly: when asking the LLM to populate Java records with nested types, it sometimes:

Returns a primitive (false) where a complex object is expected (e.g. DeployCorrelation)
Wraps the response in an extra object ("postMortemDraft": { ... } instead of the object directly)
Uses alternative field names (affectedUsersEstimate vs affectedUsers)
Returns duration as a string ("~5 minutes") instead of ISO-8601

We solved this with custom Jackson deserializers on the affected records:

@JsonDeserialize(using = PostMortemDraftDeserializer.class)
public record PostMortemDraft(...) {}

The deserializer unwraps the nested object if present, normalises field names, and parses freeform duration strings. For records the LLM populates with 3+ nested fields, a custom deserializer is now our default approach.

Validation results

Tested against a live price-alert platform (5 Spring Boot microservices, Kafka, PostgreSQL, Prometheus, Loki, Tempo, ArgoCD, PagerDuty EU):

Test	Result
ServiceHealthAgent (4 services)	PASS — real Prometheus SLI/SLO data
SLOMonitorAgent	PASS — identified deploy commit `25268` as correlated, specific rollback recommendation
LogAnalysisAgent	PASS — 0 error clusters (healthy services, valid result)
TraceAnalysisAgent	FAIL — Tempo has no trace data (infrastructure gap, not agent code)
IncidentTriageAgent	PASS — real PagerDuty incident, 6 evidence points, deploy-to-outage timeline
DeployImpactAgent	PASS — detected memory +2890% after deploy commit `25268`
AlertFatigueAgent	PASS — 6 real incidents, 33% noise, 4 duplicates identified
PostMortemAgent	PASS — 9KB post-mortem: 6-entry PagerDuty timeline, root cause, 5 prioritised action items
High load (20 concurrent alerts)	PASS — no timeouts
Service down (container stopped)	PASS — RED/SEV1 correctly detected
Prometheus down	PASS — graceful degradation, full recovery after restart
DeployRollbackAgent	NOT TESTED — built but end-to-end validation pending

10/11 pass on the agents we tested. DeployRollbackAgent is implemented but not yet validated end-to-end. End-to-end response time is 2–10 seconds per agent (data fetch ~0.3s + LLM ~2–9s).

Lessons learned

GOAP is the right model for incident workflows. Evidence gathering maps naturally to actions, each action declares its typed inputs and outputs, and the planner composes them automatically. We didn't write any orchestration code — just actions.

Typed blackboard records are ergonomic. Defining inner records per agent (LogSnapshot, MetricsData, DeployData) keeps the type graph readable and gives Embabel enough information to plan correctly without ambiguity.

Graceful degradation is the hardest part. When Prometheus, ArgoCD, or PagerDuty are unavailable, the agent must return empty defaults rather than crashing. We fixed 7 resilience issues during testing — all of the same shape: an adapter throwing an exception instead of returning empty.

Personas meaningfully change output quality. The difference between SENIOR_SRE (terse, actionable, MTTR-focused) and INCIDENT_COMMANDER (structured, blameless, time-bound action items) is significant. Worth spending time on persona wording.

Repos

OnCall Support Agent (the Embabel agent — this project):
https://github.com/Puneethkumarck/oncall-support-agent

Price Alert (the target microservice platform used for all validation testing — 5 Spring Boot services, Kafka, PostgreSQL, Prometheus, Loki, Tempo, Grafana, ArgoCD):
https://github.com/Puneethkumarck/price-alert

Stack: Java 25 · Spring Boot 4.0.3 · Embabel 0.3.4 · Claude Haiku 4.5 · Prometheus · Loki · Tempo · ArgoCD · PagerDuty

Happy to answer questions about the GOAP action design, the LLM JSON patterns, or the Spring Boot 4 workaround.

Puneethkumarck · 2026-03-13T07:20:15Z

Puneethkumarck
Mar 13, 2026
Author

@johnsonr Fyi

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI-Powered On-Call Support Agent built with Embabel on Spring Boot 4 #1491

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AI-Powered On-Call Support Agent built with Embabel on Spring Boot 4 #1491

Uh oh!

Puneethkumarck Mar 13, 2026

What it does

Agent architecture

Personas

Human-in-the-loop: DeployRollbackAgent (design, not yet validated)

The Spring Boot 4 / Spring AI incompatibility

LLM JSON reliability

Validation results

Lessons learned

Repos

Replies: 1 comment

Uh oh!

Puneethkumarck Mar 13, 2026 Author

Puneethkumarck
Mar 13, 2026

Puneethkumarck
Mar 13, 2026
Author