Name	Name	Last commit message	Last commit date
parent directory ..
.claude-plugin	.claude-plugin
agents	agents
hooks	hooks
scripts	scripts
skills/lich-sandbox	skills/lich-sandbox
state	state
README.md	README.md

lich-sandbox

M5 Bounded Subprocess Dry-Run — turn M1's static suspicions into confirmed runtime failures, or retire them as false positives. Python stdlib only. Zero deps.

Why this exists

Static analyzers over-report. A div-zero flag on sum(xs) / len(xs) is technically correct and operationally noise: is xs ever empty at this call site? Six existing reviewers will tell you "maybe." None of them will run the function with xs=[] and show you the ZeroDivisionError.

Dynamic confirmation usually requires root, a Docker daemon, a cloud sandbox, or a fixture corpus the team has to maintain. Those costs are why nobody ships confirmation at plugin weight.

M5 is the middle ground. A ~60-line resource-capped subprocess fence runs the flagged function with one synthesized boundary witness. Either the bug surfaces with a real traceback, or it doesn't. No mock. No fixture corpus. The developer's repo already has the code; M5 only adds the fence.

Confirmed bugs are facts, not probabilities. A confirmed ZeroDivisionError at flag_ref.function with witness {"args":[[]]} is a hard FAIL in the verdict composer (CLAUDE.md § Behavioral contract 5) — never averaged with rubric scores.

The 60-second mental model

  M1 flag  ->  witness synth  ->  RLIMIT + alarm subprocess  ->  traceback match  ->  run-log record
 (static)       (boundary)        (6 caps, scrubbed env)        (flag_class        (confirmed-bug /
                                                                  correspondence)    timeout / no-bug)

Invariants:

One subprocess per witness. No shared state across runs.
Caps installed in preexec_fn between fork() and exec().
Per-run tempfile.mkdtemp() cwd, deleted on exit.
Scrubbed env: no proxies, pinned PATH=/usr/bin:/bin, UTF-8 locale only.
1 MB per-stream cap on child stdout/stderr; truncation, not fail.
Parent fences the child with timeout=SIGNAL_ALARM_SEC + 5 as backstop.

The six caps (what each one prevents)

Values are literal from scripts/limits.py. These are the ACE-risk mitigation for executing developer code on every PR.

Cap	Value	Prevents
`RLIMIT_CPU`	5 s	CPU-bound infinite loops (BFS explosion, cryptographic accident)
`RLIMIT_AS`	512 MB	OOM from unbounded list-build, `x * 10**9` accidental allocation
`RLIMIT_NOFILE`	16	File-descriptor exhaustion (loop opening files, leaked sockets)
`RLIMIT_FSIZE`	10 MB	Disk-fill DoS (log spam, accidental `dd`-to-file)
`RLIMIT_NPROC`	0	Fork bombs — `subprocess.Popen` loop, `multiprocessing.Pool`
`signal.alarm`	10 s	Blocked I/O (socket read on dead connection) `RLIMIT_CPU` cannot see

Relaxing any value converts the sandbox into arbitrary-code-execution-on-every-PR. The contract requires a documented security review before the number changes. See CLAUDE.md § Behavioral contract 2.

Soft and hard limits are set identical — no headroom for escalation inside the child.

Outcome taxonomy

Every run records exactly one of six statuses to state/run-log.jsonl. No binary success/fail collapse.

Status	Meaning
`confirmed-bug`	Witness reproduced the exact `flag_class`-matching exception. Hard FAIL trigger.
`timeout-without-confirmation`	`SIGALRM` fired before the bug surfaced. HOLD trigger — could be unreachable, or a separate hang.
`no-bug-found`	Child exited 0, or exited with an unrelated exception. Static flag stands; no dynamic confirmation.
`input-synthesis-failed`	No type-valid witness could be constructed. Reported honestly; not a pass.
`sandbox-error`	Infra failure (SIGKILL from AS/CPU cap, SIGXFSZ, runner exception). Not a review finding.
`platform-unsupported`	Host lacks `resource` and no WSL bridge available. Verdict notes M5 did not run.

The flag_class -> expected exception correspondence is narrow on purpose: a ZeroDivisionError only confirms a div-zero flag, not an index-oob one. If the child raises something unrelated, v1 returns no-bug-found and lets M1 reflag on the next pass under the correct rule.

Example: the div-zero confirmation flow

Buggy target (repo/stats.py):

def average(nums):
    return sum(nums) / len(nums)

M1 flags line 2 with flag_class="div-zero" and witness_hints={"divisor_name": "nums"}. Witness synthesis produces {"args": [[]]} as the boundary input. M5 runs the child:

$ python plugins/lich-sandbox/scripts/sandbox.py
{"confirmed": 1, "timeout": 0, "sandbox_error": 0, "no_bug": 0, "synth_failed": 0, "unsupported": 0}

The run-log record:

{
  "ts": "2026-04-20T14:23:11.482+00:00",
  "flag_ref": {"file": "repo/stats.py", "function": "average", "flag_class": "div-zero", "line": 2},
  "witness": {"args": [[]]},
  "status": "confirmed-bug",
  "exit_code": 1,
  "signal": null,
  "error_class": "ZeroDivisionError",
  "traceback_head": "Traceback (most recent call last):\n  File \"_sandbox_child.py\", ...\nZeroDivisionError: division by zero",
  "duration_ms": 47,
  "backend": "posix"
}

That record is a fact, not a score. The verdict composer reads it and emits FAIL. The developer sees the exact input that breaks the function, not a ruling about severity.

Running it

Seed flags from M1:

python plugins/lich-core/scripts/__main__.py <target_file_or_dir>
# writes plugins/lich-core/state/review-flags.jsonl

Confirm with M5:

python plugins/lich-sandbox/scripts/sandbox.py
# reads the default input path, writes plugins/lich-sandbox/state/run-log.jsonl
# prints JSON summary: {"confirmed": N, "timeout": N, ...}

Custom paths:

python plugins/lich-sandbox/scripts/sandbox.py my-flags.jsonl my-log.jsonl

Install via the marketplace:

/plugin install lich-sandbox@lich
# or the bundle: /plugin install full@lich

lich-core must produce the flags file; install both, or use full.

Platform reality

Linux / macOS / native POSIX: full sandbox. All 6 caps + signal.alarm active.
Windows with WSL: bridged via scripts/bridge/wsl.py. Child runs under wsl.exe -e env -i python3; Windows paths translate to /mnt/<letter>/...; same 6 caps apply inside WSL. Signal detection is stderr-marker based (Alarm clock, Killed, File size limit exceeded) rather than negative returncode.
Windows without WSL: platform-unsupported emitted per flag. Not silent; the verdict notes M5 did not run and falls back to M1-only judgment with reduced confidence.
Node / TypeScript runner (scripts/runners/node.py): weaker sandbox. Heap cap via --max-old-space-size=512 and a 10 s parent timeout, but no equivalents for RLIMIT_NOFILE, RLIMIT_FSIZE, RLIMIT_NPROC. Documented honestly on the tin. .ts / .tsx targets require tsx or ts-node in the environment; absent that, the runner returns input-synthesis-failed rather than pretending it transpiled.

Why this is the moat

No existing reviewer ships static-suspicion -> sandboxed-confirmation at zero-dep plugin weight. Semgrep and ESLint are static-only. CodeQL needs a build extraction step. Infer needs Clang and a compile database. Pyre is type-checking, not runtime. Every alternative either stops at the flag (static tools) or demands infrastructure the reviewer does not own (Docker, cloud sandbox, CI runner). M5 runs in a preexec_fn between fork and exec, on the developer's machine, with six stdlib caps and no config — and that contract, "confirmed bugs are facts, not probabilities," comes from actually running the code, not from scoring it.

Non-duplication contract

M5 does not:

Scan for CWEs. That is Hydra's R3 lane — 98 CWEs across 2,011 patterns. M5 boosts M6 attention weight on Hydra-flagged files but never reclassifies.
Classify changes. That is Crow's V1/V2 lane. M5 consumes Crow's trust score into the M6 prior.
Emit verdicts. That is lich-verdict's lane. M5 writes run-log.jsonl records; the verdict composer reads them.
Persist witness outputs. The per-run tempdir is ephemeral. Only the run-log record survives.

Breaking the split fractures severity source-of-truth across plugins.

State files

File	Schema keys	Purpose
`state/run-log.jsonl`	`ts`, `flag_ref`, `witness`, `status`, `exit_code`, `signal`, `error_class`, `traceback_head`, `duration_ms`, `backend`	Append-only record, one line per witness run.

traceback_head is bounded at 1000 chars. backend is posix / wsl / unsupported. The record shown in the div-zero example above is the canonical shape.

Security

This plugin executes developer code. The six resource caps are the only thing between a PR's payload and the host. Any cap relaxation requires a documented security review (CLAUDE.md § Behavioral contract 2).

Sandbox demo walkthrough
lich-core README — M1 static pass that feeds M5
lich-verdict README — composes M1 + M5 + M6 + M7 into DEPLOY / HOLD / FAIL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

lich-sandbox

Why this exists

The 60-second mental model

The six caps (what each one prevents)

Outcome taxonomy

Example: the div-zero confirmation flow

Running it

Platform reality

Why this is the moat

Non-duplication contract

State files

Security

Next

FilesExpand file tree

lich-sandbox

Directory actions

More options

Directory actions

More options

Latest commit

History

lich-sandbox

Folders and files

parent directory

README.md

lich-sandbox

Why this exists

The 60-second mental model

The six caps (what each one prevents)

Outcome taxonomy

Example: the div-zero confirmation flow

Running it

Platform reality

Why this is the moat

Non-duplication contract

State files

Security

Next