OpenA2A: CLI · HackMyAgent · Secretless · AIM · Browser Guard · DVAA
Note: OASB controls are also available in HackMyAgent v0.8.0+ via
opena2a benchmark. This repository is the canonical source for the full 222-test evaluation suite and is actively maintained. ARP (the reference adapter) is now part of HackMyAgent — install vianpm install arp-guard.
MITRE ATT&CK Evaluations, but for AI agent security products.
222 standardized attack scenarios that evaluate whether a runtime security product can detect and respond to threats against AI agents. Each test is mapped to MITRE ATLAS and OWASP Agentic Top 10. Plug in your product, run the suite, get a detection coverage scorecard.
OASB Website | MITRE ATLAS Coverage
| Date | Change |
|---|---|
| 2026-04-02 | Scanner Benchmark v2: 4,245-sample corpus, 3 HMA adapter tiers (static/TME/pipeline), DVAA ground-truth comparison. TME v0.5.0 achieves 89.2% F1. Comparison with Holzbauer et al. (arXiv:2603.16572). |
| 2026-03-23 | arp-guard v0.3.0 — ARP now re-exports from HackMyAgent. Updated OASB to v0.3.0. All 222 tests pass. Updated Quick Start (no standalone ARP clone). |
| 2026-02-19 | Added 40 AI-layer test scenarios (AT-AI-001 through AT-AI-005) for prompt, MCP, and A2A scanning via ARP v0.2.0. Total tests: 222. |
| 2026-02-18 | Added integration tests for DVAA v0.4.0 MCP JSON-RPC and A2A endpoints. |
| 2026-02-09 | Initial release -- 182 attack scenarios across 10 MITRE ATLAS techniques. |
OASB evaluates security products, not agents. It answers: "does your runtime protection actually catch these attacks?"
| OASB | HackMyAgent | |
|---|---|---|
| Purpose | Evaluate security products | Pentest AI agents |
| Tests | "Does your EDR catch this exfiltration?" | "Is your agent leaking credentials?" |
| Audience | Security product vendors, evaluators | Agent developers, red teams |
| Analogous to | MITRE ATT&CK Evaluations | OWASP ZAP / Burp Suite |
| Method | Controlled lab — inject attacks, measure detection | Active scanning + adversarial payloads against live targets |
| Output | Detection coverage scorecard | Vulnerability report + auto-fix |
Use both together: HackMyAgent finds vulnerabilities in your agent, OASB proves your security product catches real attacks.
- Quick Start
- Usage via OpenA2A CLI
- What Gets Tested
- Test Categories
- Atomic Tests — 65 discrete detection tests (OS-level + AI-layer)
- Integration Tests — 8 multi-step attack chains
- Baseline Tests — 3 false positive validations
- E2E Tests — 6 real OS-level detection tests
- MITRE ATLAS Coverage
- Test Harness
- Skills Security Benchmark
- Known Detection Gaps
- License
Ships with ARP (arp-guard) as the reference adapter. To evaluate your own security product, implement the SecurityProductAdapter interface in src/harness/adapter.ts and run the same 222 tests.
git clone https://github.com/opena2a-org/oasb.git
cd oasb && npm install
arp-guardis an optional peer dependency. It is installed automatically for running the reference ARP evaluation. If you are implementing your own adapter, you do not need it.
npm test # Full evaluation (222 tests)
npm run test:atomic # 65 atomic tests (no external deps)
npm run test:integration # 8 integration scenarios
npm run test:baseline # 3 baseline tests
npx vitest run src/e2e/ # 6 E2E tests (real OS detection)OASB is available as a built-in adapter in the OpenA2A CLI via the benchmark command. The CLI delegates to the oasb package using an import adapter, so no separate installation is needed if you already have the CLI installed.
opena2a benchmark runExecutes all 222 test scenarios (atomic, integration, baseline, and E2E) and produces a detection coverage scorecard.
opena2a benchmark run --technique T0015Filters the benchmark to a single MITRE ATLAS technique ID (e.g., T0015 for Evasion). Useful for targeted evaluation of a specific detection capability.
opena2a benchmark run --format jsonOutputs the compliance score and per-technique detection rates as JSON. Integrate this into CI pipelines to enforce minimum detection thresholds on every build.
opena2a benchmark run --technique T0057 --format jsonFlags can be combined to run a single technique and produce JSON output for automated processing.
Each test simulates a specific attack technique and checks whether the security product under evaluation detects it, classifies it correctly, and responds appropriately.
| Category | Tests | What It Evaluates |
|---|---|---|
| Process detection | 25 | Child process spawns, suspicious binaries, privilege escalation, CPU anomalies |
| Network detection | 20 | Outbound connections, suspicious hosts, exfiltration, subdomain bypass |
| Filesystem detection | 28 | Sensitive path access, credential files, dotfile persistence, mass file DoS |
| Intelligence layers | 21 | Rule matching, anomaly scoring, LLM escalation, budget exhaustion |
| Enforcement actions | 18 | Logging, alerting, process pause (SIGSTOP), kill (SIGTERM/SIGKILL), resume |
| Multi-step attacks | 33 | Data exfiltration chains, MCP tool abuse, prompt injection, A2A trust exploitation |
| Baseline behavior | 13 | False positive rates, anomaly injection, baseline persistence |
| Real OS detection | 14 | Live filesystem watches, process polling, network monitoring |
| Application-level hooks | 14 | Pre-execution interception of spawn, connect, read/write |
| AI-layer scanning | 40 | Prompt injection/output, MCP tool call validation, A2A message scanning, pattern coverage |
| Total | 222 | 10 MITRE ATLAS techniques |
Discrete tests that exercise individual detection capabilities. Each test injects a single attack event and verifies the product detects it with the correct classification and severity.
AI-Layer Scanning — 5 files (40 tests)
| Test | What the Product Should Detect |
|---|---|
| AT-AI-001 | Prompt input scanning — PI, JB, DE, CM pattern detection (11 tests) |
| AT-AI-002 | Prompt output scanning — OL pattern detection, data leak prevention (6 tests) |
| AT-AI-003 | MCP tool call scanning — path traversal, command injection, SSRF, allowlist (11 tests) |
| AT-AI-004 | A2A message scanning — identity spoofing, delegation abuse, trust validation (7 tests) |
| AT-AI-005 | Pattern coverage — all 19 patterns detect known payloads, no false positives (5 tests) |
Process Detection — 5 files
| Test | ATLAS | What the Product Should Detect |
|---|---|---|
| AT-PROC-001 | AML.T0046 | Child process spawn |
| AT-PROC-002 | AML.T0046 | Suspicious binary execution (curl, wget, nc) |
| AT-PROC-003 | AML.T0029 | High CPU anomaly |
| AT-PROC-004 | AML.T0046 | Privilege escalation (root user) |
| AT-PROC-005 | AML.TA0006 | Process termination |
Network Detection — 5 files
| Test | ATLAS | What the Product Should Detect |
|---|---|---|
| AT-NET-001 | AML.T0024 | New outbound connection |
| AT-NET-002 | AML.T0057 | Connection to suspicious host (webhook.site, ngrok) |
| AT-NET-003 | AML.T0029 | Connection burst |
| AT-NET-004 | AML.T0024 | Subdomain bypass of allowlist |
| AT-NET-005 | AML.T0057 | Exfiltration destination |
Filesystem Detection — 5 files
| Test | ATLAS | What the Product Should Detect |
|---|---|---|
| AT-FS-001 | AML.T0057 | Sensitive path access (.ssh, .aws, .gnupg) |
| AT-FS-002 | AML.T0046 | Access outside allowed paths |
| AT-FS-003 | AML.T0057 | Credential file access (.npmrc, .pypirc, .netrc) |
| AT-FS-004 | AML.T0029 | Mass file creation (DoS) |
| AT-FS-005 | AML.T0018 | Shell config modification (.bashrc, .zshrc) |
Intelligence — 5 files
| Test | ATLAS | What the Product Should Do |
|---|---|---|
| AT-INT-001 | AML.T0054 | Match rules and trigger enforcement |
| AT-INT-002 | AML.T0015 | Score statistical anomalies (z-score) |
| AT-INT-003 | AML.T0054 | Escalate to LLM-assisted assessment |
| AT-INT-004 | AML.T0029 | Handle budget exhaustion gracefully |
| AT-INT-005 | AML.T0015 | Learn and reset behavioral baselines |
Enforcement — 5 files
| Test | ATLAS | What the Product Should Do |
|---|---|---|
| AT-ENF-001 | AML.TA0006 | Execute log action |
| AT-ENF-002 | AML.TA0006 | Fire alert callback |
| AT-ENF-003 | AML.TA0006 | Pause process (SIGSTOP) |
| AT-ENF-004 | AML.TA0006 | Kill process (SIGTERM/SIGKILL) |
| AT-ENF-005 | AML.TA0006 | Resume paused process (SIGCONT) |
Multi-step attack chains that combine multiple techniques. Tests whether the product can detect coordinated attacks, not just isolated events. Optionally validates against live DVAA agents.
| Test | ATLAS | Attack Chain |
|---|---|---|
| INT-001 | AML.T0057 | Data exfiltration: internal contact lookup → credential harvest → webhook.site POST |
| INT-002 | AML.T0056 | MCP tool abuse: path traversal + command injection via tool arguments |
| INT-003 | AML.T0051 | Prompt injection: establish baseline → inject malicious prompt → measure detection |
| INT-004 | AML.T0024 | A2A trust exploitation: spoofed agent identity → unauthorized data access |
| INT-005 | AML.T0015 | Evasion: 5 minutes normal traffic → sudden attack burst → verify anomaly detection |
| INT-006 | AML.T0046 | Multi-monitor correlation: single attack triggers process + network + filesystem events |
| INT-007 | AML.T0029 | Budget exhaustion: noise flood drains LLM budget → real attack goes unanalyzed |
| INT-008 | AML.TA0006 | Kill switch: critical threat → product kills agent → verify death → recovery |
Every security product must avoid false positives. These tests verify the product stays quiet during normal operations.
| Test | What It Proves |
|---|---|
| BL-001 | Zero false positives from normal agent activity |
| BL-002 | Controlled anomaly injection triggers detection (not silent) |
| BL-003 | Baseline persistence across product restarts |
Real OS-level detection — no mocks, no event injection. These tests spawn real processes, open real connections, and write real files, then verify the product detects them.
Live Monitors — OS-level polling
| Test | Latency | What the Product Should Detect |
|---|---|---|
| E2E-001 | ~200ms | fs.watch detects .env, .ssh, .bashrc, .npmrc writes |
| E2E-002 | ~1000ms | ps polling detects child processes, suspicious binaries |
| E2E-003 | ~1000ms | lsof detects outbound TCP (skips if unavailable) |
Interceptors — application-level hooks
| Test | Latency | What the Product Should Intercept |
|---|---|---|
| E2E-004 | <1ms | child_process.spawn/exec intercepted before execution |
| E2E-005 | <1ms | net.Socket.connect intercepted before connection |
| E2E-006 | <1ms | fs.writeFileSync/readFileSync intercepted before I/O |
10 unique techniques across 47 test files:
| Technique | ID | Tests |
|---|---|---|
| Unsafe ML Inference | AML.T0046 | AT-PROC-001/002/004, AT-FS-002, INT-006, E2E-002/004 |
| Data Leakage | AML.T0057 | AT-NET-002/005, AT-FS-001/003, INT-001, E2E-001/006 |
| Exfiltration | AML.T0024 | AT-NET-001/004, INT-004, E2E-003/005 |
| Persistence | AML.T0018 | AT-FS-005, E2E-001/006 |
| Denial of Service | AML.T0029 | AT-PROC-003, AT-NET-003, AT-INT-004, INT-007 |
| Evasion | AML.T0015 | AT-INT-002/005, INT-005, BL-002/003 |
| Jailbreak | AML.T0054 | AT-INT-001/003 |
| MCP Compromise | AML.T0056 | INT-002 |
| Prompt Injection | AML.T0051 | INT-003 |
| Defense Response | AML.TA0006 | AT-ENF-001-005, AT-PROC-005, INT-008 |
The harness wraps a security product via an adapter interface and provides event collection, injection, and metrics.
| File | Purpose |
|---|---|
adapter.ts |
Product-agnostic adapter interface — implement SecurityProductAdapter for your product |
arp-wrapper.ts |
Reference adapter — wraps ARP (arp-guard) with event collection, injection helpers |
event-collector.ts |
Captures events with async waitForEvent(predicate, timeout) |
mock-llm-adapter.ts |
Deterministic LLM for intelligence layer testing (pattern-based responses) |
dvaa-client.ts |
HTTP client for DVAA vulnerable agent endpoints |
dvaa-manager.ts |
DVAA process lifecycle (spawn, health check, teardown) |
metrics.ts |
Detection rate, false positive rate, P95 latency computation |
To evaluate your own product: implement SecurityProductAdapter from src/harness/adapter.ts, swap it into the test harness, and run the full suite. The interface defines event types, scanner interfaces, and enforcement contracts — no dependency on any specific product.
A dedicated scoring engine for evaluating the security posture of AI agent skills (tool-use capabilities). Covers 9 attack categories targeting skill invocation, parameter validation, output handling, and inter-skill trust boundaries.
| Category | Focus |
|---|---|
| Parameter injection | Malicious input via skill arguments |
| Output manipulation | Tampered or poisoned skill outputs |
| Privilege escalation | Skills accessing resources beyond their scope |
| Cross-skill trust abuse | One skill exploiting trust granted to another |
| Data exfiltration via skills | Skills used as exfiltration channels |
| Denial of service | Resource exhaustion through skill invocation |
| Skill impersonation | Spoofed skill identity in multi-agent flows |
| Configuration tampering | Modified skill manifests or permissions |
| Supply chain compromise | Malicious skill packages or dependencies |
| Control | Requirement |
|---|---|
| SS-01 | Skill argument validation and sanitization |
| SS-02 | Output integrity verification |
| SS-03 | Least-privilege scope enforcement |
| SS-04 | Inter-skill authentication |
| SS-05 | Invocation rate limiting |
| SS-06 | Skill manifest integrity (signed, versioned) |
| SS-07 | Runtime permission boundary enforcement |
| SS-08 | Audit logging of all skill invocations |
| SS-09 | Dependency provenance verification |
| SS-10 | Graceful degradation on skill failure |
| Level | Name | Requirements |
|---|---|---|
| L1 | Basic | SS-01 through SS-04 pass |
| L2 | Standard | L1 + SS-05 through SS-08 pass |
| L3 | Advanced | L2 + SS-09 and SS-10 pass, all 9 attack categories covered |
Products achieving full coverage receive a tier designation:
| Tier | Criteria |
|---|---|
| Platinum | L3 compliance, all 9 attack categories detected, zero false positives in baseline |
| Gold | L2 compliance, 7+ attack categories detected |
| Silver | L1 compliance, 4+ attack categories detected |
4,245 ground-truth labeled samples for scanner evaluation:
| Count | Description | |
|---|---|---|
| Malicious | 270 | 30 per attack category (9 categories) from DVAA, ARIA, HMA payloads, expert review |
| Benign | 3,881 | Real skills from registry, open-source repos, well-governed configs |
| Edge cases | 94 | Security tools, defensive governance, broad-permission configs |
npx tsx scripts/run-benchmark-v2.ts --categorized-only # Full corpus, all adapters
npx tsx scripts/run-benchmark-v2.ts --categorized-only --limit=100 # Quick test with 100 samples
npx tsx scripts/run-dvaa-benchmark.ts # DVAA ground-truth comparison (70 scenarios)Comparison of HMA scanner tiers on 4,245 labeled samples:
| Scanner | F1 | Precision | Recall | FPR | Flag Rate |
|---|---|---|---|---|---|
| HMA Static (regex only) | 67.5% | 99.3% | 51.1% | 0.03% | 3.6% |
| NanoMind TME v0.5.0 (model only) | 89.2% | 88.4% | 90.0% | 0.82% | 6.9% |
| HMA Full Pipeline (AST + NanoMind) | 81.3% | 68.5% | 100.0% | 3.20% | 10.3% |
DVAA controlled comparison: 61/70 scenarios detected (87.1%).
See BENCHMARK-RESULTS.md for full per-category breakdown and comparison with Holzbauer et al. (arXiv:2603.16572).
OASB documents what the reference product (ARP) does and doesn't catch. Other products may have different gap profiles — that's the point of running the benchmark.
| Gap | Severity | Test | Notes |
|---|---|---|---|
| Anomaly baselines not persisted across restarts | Medium | BL-003 | In-memory only; restarts lose learned behavior |
| No connection rate anomaly detection | Medium | AT-NET-003 | Network monitor tracks hosts, not burst rates |
| No HTTP response body monitoring | Low | INT-003 | AI-layer output scanning (PromptInterceptor.scanOutput) covers LLM responses; raw HTTP responses not inspected |
| No cross-monitor event correlation | Architectural | INT-006 | EventEngine is a flat bus; no attack-chain aggregation |
Apache-2.0
| Project | Description | Install |
|---|---|---|
| AIM | Agent Identity Management -- identity and access control for AI agents | npm install @opena2a/aim-core |
| HackMyAgent | Security scanner -- 204 checks, attack mode, auto-fix | npx hackmyagent secure |
| ARP | Agent Runtime Protection -- process, network, filesystem, AI-layer monitoring | npm install arp-guard |
| Secretless AI | Keep credentials out of AI context windows | npx secretless-ai init |
| DVAA | Damn Vulnerable AI Agent -- security training and red-teaming | docker pull opena2a/dvaa |
