Run mcp-compliance benchmark against a fast stdio fixture: the whole 88-test suite completes in ~3 seconds. Rough breakdown:
| Phase | Cost | Notes |
|---|---|---|
| Transport spawn / TCP handshake | 50–150 ms | Once per run |
| Lifecycle handshake (initialize + notifications/initialized) | ~50 ms | Sequential — can't parallelize; must happen first |
| Main test loop (82 independent tests × ~25 ms each) | ~2000 ms | This is the bulk of the runtime |
| Cleanup / close | 50–200 ms | Sequential for correctness |
Takeaway: the ~2 s of main-loop work is what parallel execution would address. Cut it to ~500 ms if we ran 4 tests in flight simultaneously. For a CI job that already takes minutes, the savings are noise. For a human dev loop (watch mode), it's the difference between "snappy" and "there's noticeable latency."
Four design hazards, any one of which can silently corrupt results:
-
Session state mutation.
lifecycle-initstores theMcp-Session-Idreturned by the server, and subsequent tests inject it. If init hasn't finished whentools/listfires,tools/listgoes out with a null session and fails. The test runner currently avoids this by serializing — you can't race something that doesn't start until the prior await resolves. Parallel execution needs an explicit phase barrier. -
Capability detection.
hasTools/hasResources/hasPromptsare read by later tests to decide whether to skip. They're set during init, but a naivePromise.allover "all post-init tests" might capture the closure before init's side effects land. TDZ-looking errors would result. We already fixed this once (hoisting theconstdeclarations before the tests that read them); parallelization reintroduces similar ordering hazards in ways that only surface under load. -
Caches.
cachedToolsListis populated bytools-listand reused bytools-call/tools-schema/security-command-injection(etc). Under parallel execution, two of those tests could race before the cache fills, each making their owntools/listround-trip. Correct, but defeats the optimization. Worse: if the server returns slightly differenttoolsarrays on consecutive calls (some do — they sort by most-recently-used), the tests see inconsistent data. -
Stdio server assumptions. The MCP spec allows concurrent in-flight requests on a single session, but not every server implements it correctly. We've seen reference servers (and several third-party ones) that assume single-threaded access and deadlock when multiple
tools/callfire simultaneously. A compliance tool that triggers server bugs to run faster is a compliance tool that doesn't get used.
When we take this on, the shape is:
-
Add a
parallelSafe: booleanflag toTestDefinition. Defaultfalse. Tests that demonstrably don't touch cached state or require ordering gettrue. Audit per-test. -
Split the runner into three phases:
- Setup (sequential): preflight, initialize, notifications/initialized, cache population
- Parallel (
Promise.allwith pool of N): allparallelSafe: truetests - Sequential (in order): everything else, same as today
-
Add a
--concurrency Nflag (default 1, matching today's behavior). Raising it opts into the parallel phase. -
Instrument with tracing: if a test relies on a cache that wasn't populated, it should fail fast with a clear "ordering violation" message — not silently make its own round-trip.
-
Document known-incompatible servers as a table in
docs/PERFORMANCE.md. Expect ~10% of servers to fail under concurrency because they don't handle it; users who hit that keep--concurrency 1.
- Use
--watchfor dev-loop speed — re-runs are fast because the child process stays warm between edits. - Use
--only <category>to test just the section you're working on. mcp-compliance benchmark --concurrency 4gives you pure pressure/throughput numbers without worrying about compliance semantics.- For CI, the full suite at 3 s is already faster than most test suites — probably not worth optimizing first.
Tracking issue: YawLabs/mcp-compliance#TBD.