M5 Bounded Subprocess Dry-Run — turn M1's static suspicions into confirmed runtime failures, or retire them as false positives. Python stdlib only. Zero deps.
Static analyzers over-report. A div-zero flag on sum(xs) / len(xs) is technically correct and operationally noise: is xs ever empty at this call site? Six existing reviewers will tell you "maybe." None of them will run the function with xs=[] and show you the ZeroDivisionError.
Dynamic confirmation usually requires root, a Docker daemon, a cloud sandbox, or a fixture corpus the team has to maintain. Those costs are why nobody ships confirmation at plugin weight.
M5 is the middle ground. A ~60-line resource-capped subprocess fence runs the flagged function with one synthesized boundary witness. Either the bug surfaces with a real traceback, or it doesn't. No mock. No fixture corpus. The developer's repo already has the code; M5 only adds the fence.
Confirmed bugs are facts, not probabilities. A confirmed ZeroDivisionError at flag_ref.function with witness {"args":[[]]} is a hard FAIL in the verdict composer (CLAUDE.md § Behavioral contract 5) — never averaged with rubric scores.
M1 flag -> witness synth -> RLIMIT + alarm subprocess -> traceback match -> run-log record
(static) (boundary) (6 caps, scrubbed env) (flag_class (confirmed-bug /
correspondence) timeout / no-bug)
Invariants:
- One subprocess per witness. No shared state across runs.
- Caps installed in
preexec_fnbetweenfork()andexec(). - Per-run
tempfile.mkdtemp()cwd, deleted on exit. - Scrubbed env: no proxies, pinned
PATH=/usr/bin:/bin, UTF-8 locale only. - 1 MB per-stream cap on child stdout/stderr; truncation, not fail.
- Parent fences the child with
timeout=SIGNAL_ALARM_SEC + 5as backstop.
Values are literal from scripts/limits.py. These are the ACE-risk mitigation for executing developer code on every PR.
| Cap | Value | Prevents |
|---|---|---|
RLIMIT_CPU |
5 s | CPU-bound infinite loops (BFS explosion, cryptographic accident) |
RLIMIT_AS |
512 MB | OOM from unbounded list-build, x * 10**9 accidental allocation |
RLIMIT_NOFILE |
16 | File-descriptor exhaustion (loop opening files, leaked sockets) |
RLIMIT_FSIZE |
10 MB | Disk-fill DoS (log spam, accidental dd-to-file) |
RLIMIT_NPROC |
0 | Fork bombs — subprocess.Popen loop, multiprocessing.Pool |
signal.alarm |
10 s | Blocked I/O (socket read on dead connection) RLIMIT_CPU cannot see |
Relaxing any value converts the sandbox into arbitrary-code-execution-on-every-PR. The contract requires a documented security review before the number changes. See CLAUDE.md § Behavioral contract 2.
Soft and hard limits are set identical — no headroom for escalation inside the child.
Every run records exactly one of six statuses to state/run-log.jsonl. No binary success/fail collapse.
| Status | Meaning |
|---|---|
confirmed-bug |
Witness reproduced the exact flag_class-matching exception. Hard FAIL trigger. |
timeout-without-confirmation |
SIGALRM fired before the bug surfaced. HOLD trigger — could be unreachable, or a separate hang. |
no-bug-found |
Child exited 0, or exited with an unrelated exception. Static flag stands; no dynamic confirmation. |
input-synthesis-failed |
No type-valid witness could be constructed. Reported honestly; not a pass. |
sandbox-error |
Infra failure (SIGKILL from AS/CPU cap, SIGXFSZ, runner exception). Not a review finding. |
platform-unsupported |
Host lacks resource and no WSL bridge available. Verdict notes M5 did not run. |
The flag_class -> expected exception correspondence is narrow on purpose: a ZeroDivisionError only confirms a div-zero flag, not an index-oob one. If the child raises something unrelated, v1 returns no-bug-found and lets M1 reflag on the next pass under the correct rule.
Buggy target (repo/stats.py):
def average(nums):
return sum(nums) / len(nums)M1 flags line 2 with flag_class="div-zero" and witness_hints={"divisor_name": "nums"}. Witness synthesis produces {"args": [[]]} as the boundary input. M5 runs the child:
$ python plugins/lich-sandbox/scripts/sandbox.py
{"confirmed": 1, "timeout": 0, "sandbox_error": 0, "no_bug": 0, "synth_failed": 0, "unsupported": 0}The run-log record:
{
"ts": "2026-04-20T14:23:11.482+00:00",
"flag_ref": {"file": "repo/stats.py", "function": "average", "flag_class": "div-zero", "line": 2},
"witness": {"args": [[]]},
"status": "confirmed-bug",
"exit_code": 1,
"signal": null,
"error_class": "ZeroDivisionError",
"traceback_head": "Traceback (most recent call last):\n File \"_sandbox_child.py\", ...\nZeroDivisionError: division by zero",
"duration_ms": 47,
"backend": "posix"
}That record is a fact, not a score. The verdict composer reads it and emits FAIL. The developer sees the exact input that breaks the function, not a ruling about severity.
Seed flags from M1:
python plugins/lich-core/scripts/__main__.py <target_file_or_dir>
# writes plugins/lich-core/state/review-flags.jsonlConfirm with M5:
python plugins/lich-sandbox/scripts/sandbox.py
# reads the default input path, writes plugins/lich-sandbox/state/run-log.jsonl
# prints JSON summary: {"confirmed": N, "timeout": N, ...}Custom paths:
python plugins/lich-sandbox/scripts/sandbox.py my-flags.jsonl my-log.jsonlInstall via the marketplace:
/plugin install lich-sandbox@lich
# or the bundle: /plugin install full@lichlich-core must produce the flags file; install both, or use full.
- Linux / macOS / native POSIX: full sandbox. All 6 caps +
signal.alarmactive. - Windows with WSL: bridged via
scripts/bridge/wsl.py. Child runs underwsl.exe -e env -i python3; Windows paths translate to/mnt/<letter>/...; same 6 caps apply inside WSL. Signal detection is stderr-marker based (Alarm clock,Killed,File size limit exceeded) rather than negative returncode. - Windows without WSL:
platform-unsupportedemitted per flag. Not silent; the verdict notes M5 did not run and falls back to M1-only judgment with reduced confidence. - Node / TypeScript runner (
scripts/runners/node.py): weaker sandbox. Heap cap via--max-old-space-size=512and a 10 s parent timeout, but no equivalents forRLIMIT_NOFILE,RLIMIT_FSIZE,RLIMIT_NPROC. Documented honestly on the tin..ts/.tsxtargets requiretsxorts-nodein the environment; absent that, the runner returnsinput-synthesis-failedrather than pretending it transpiled.
No existing reviewer ships static-suspicion -> sandboxed-confirmation at zero-dep plugin weight. Semgrep and ESLint are static-only. CodeQL needs a build extraction step. Infer needs Clang and a compile database. Pyre is type-checking, not runtime. Every alternative either stops at the flag (static tools) or demands infrastructure the reviewer does not own (Docker, cloud sandbox, CI runner). M5 runs in a preexec_fn between fork and exec, on the developer's machine, with six stdlib caps and no config — and that contract, "confirmed bugs are facts, not probabilities," comes from actually running the code, not from scoring it.
M5 does not:
- Scan for CWEs. That is Hydra's R3 lane — 98 CWEs across 2,011 patterns. M5 boosts M6 attention weight on Hydra-flagged files but never reclassifies.
- Classify changes. That is Crow's V1/V2 lane. M5 consumes Crow's trust score into the M6 prior.
- Emit verdicts. That is
lich-verdict's lane. M5 writesrun-log.jsonlrecords; the verdict composer reads them. - Persist witness outputs. The per-run tempdir is ephemeral. Only the run-log record survives.
Breaking the split fractures severity source-of-truth across plugins.
| File | Schema keys | Purpose |
|---|---|---|
state/run-log.jsonl |
ts, flag_ref, witness, status, exit_code, signal, error_class, traceback_head, duration_ms, backend |
Append-only record, one line per witness run. |
traceback_head is bounded at 1000 chars. backend is posix / wsl / unsupported. The record shown in the div-zero example above is the canonical shape.
This plugin executes developer code. The six resource caps are the only thing between a PR's payload and the host. Any cap relaxation requires a documented security review (CLAUDE.md § Behavioral contract 2).
- Sandbox demo walkthrough
- lich-core README — M1 static pass that feeds M5
- lich-verdict README — composes M1 + M5 + M6 + M7 into DEPLOY / HOLD / FAIL