Skip to content

Latest commit

 

History

History
596 lines (464 loc) · 49.3 KB

File metadata and controls

596 lines (464 loc) · 49.3 KB

Lich — Product Architecture

Phase 3 · Plugin #6 · enchanter-ai · answers the developer's sixth question: "Is this code good?"

Name. Lich. After the Lich Lords of Hollow Knight — gate-reviewers who judge worthiness through trial before letting you pass. Every PR is a supplicant at the gate; every engine is a test the code must survive. Lich joins Crow (change comprehension) and Sylph (git flow) in the Hollow Knight cluster — three HK entities for three related dev-surface plugins is intentional brand signal. This slot previously carried the placeholder name "Athena"; that name is retired because Athena is a pre-existing mythological figure Supergiant borrowed, not a game-native entity. Lich Lords are game-native and pass the naming convention.

Engine prefix. M (single letter, unique across F/A/V/S/W/L).

Trigger model. Hybrid. PostToolUse hook subscribes to Crow's change-classification signal (Phase 1: file-tail of crow/plugins/change-tracker/state/audit.jsonl; Phase 2: crow.change.classified event via enchanted-mcp) and auto-reviews affected hunks. Skill-invoked commands (/lich-review, /lich-explain) are the manual handle for ad-hoc deep reviews. Silent on DEPLOY, surfacing on HOLD/FAIL.

Panel composition. This document synthesizes four expert lenses: static-analysis researcher, dynamic-analysis & fuzzing engineer, code-review & developer-preference expert, and enchanter-ai architect. Disagreements are surfaced before resolution.


Layer 1: Language Substrate & Parser Strategy

Prior art. Python's stdlib ast module is the canonical zero-dep AST source; it parses Python 3.8+ source into a typed tree, supports ast.walk, ast.NodeVisitor, and ast.unparse since 3.9. For TypeScript, three realistic options: (1) tsc --noEmit --generateTrace <dir> subprocess-call and parse the JSON trace output, (2) ship a tree-sitter Wasm binary (tree-sitter/tree-sitter-typescript, ~3MB) and invoke via a tiny Python loader, (3) use esprima-python (pure-Python, but only JS not TS). Semgrep uses tree-sitter in production for 20+ languages; biome uses a hand-rolled Rust parser.

Options comparison.

Option Ship weight Dep class Precision on TS generics Pitfall
stdlib ast for Python 0MB stdlib N/A Python-only
tsc --generateTrace subprocess 0MB shipped runtime-optional (tsc must exist in repo) Highest (compiler-grade) Fails offline / on repos without TypeScript devDep installed
tree-sitter Wasm ~3MB bundled Medium (no type-resolution) Ship-weight burden; Wasm runtime (wasmtime-py) would add another 10MB+ — disqualifying
esprima-python stdlib-ish 0MB pure-Python wheel Low (JS-only, no TS syntax) Fails on interface, generics, decorators

Recommendation. Python substrate uses stdlib ast (zero dispute). TypeScript substrate uses tsc --noEmit --generateTrace subprocess with a graceful fallback: if tsc is not resolvable in the repo's node_modules/.bin/ or PATH, Lich's TypeScript adapter emits a one-line status ts-parse-unavailable and skips M1/M2 for .ts/.tsx files — Lich still runs M6/M7 on the diff text. Never silently pretend TS analysis ran. Tree-sitter Wasm deferred to Phase 2 as an optional "bundled-parser" plugin for offline users.

Pitfall. Static-analysis researcher warned that tsc --generateTrace is undocumented for AST extraction; its output format is a profiling trace, not a stable AST. Accepted risk: if the format changes across tsc versions, Lich's adapter pins a tested range (>=5.0,<6.0) and emits a compatibility warning outside it. This is the kind of precision/portability tradeoff the prior-art's three candidates all make somewhere.


Layer 2: M1 Cousot Interval Propagation — Abstract Domains & Widening

Prior art. Cousot & Cousot POPL'77 formalized abstract interpretation as a Galois connection between concrete and abstract domains. Every sound analyzer (Astrée, Polyspace, Mopsa, Facebook Infer Pulse, Microsoft pyright's type-narrowing) descends from this framework. The core tradeoff the paper names is between precision (how tight the abstract approximation is) and termination (widening must fire at some lattice height to guarantee fixpoint convergence on loops). Intervals are the simplest non-trivial abstract domain: ⊥ ≤ [lo, hi] ≤ ⊤ with empty and ⊤ = (-∞, +∞). Nullability adds {Null, NotNull, MaybeNull, ⊤}. Shape tracking for containers adds {Empty, NonEmpty, Unknown}. pyright ships exactly this trio plus type-narrowing for control-flow contexts.

Options comparison.

Widening strategy Terminates Precision Pitfall
No widening Never on unbounded loops Perfect Hangs — disqualified
Jump to after N iterations (N=3) Always after 3 iters Low (loses all range info on loops) Loses div-zero evidence on loop-carried divisors
Threshold widening (N=3, then jump to nearest threshold in {0, 1, 255, 65535, MAXINT}) Always after 3 iters Medium Thresholds must be language-aware (e.g. u8 vs. i32)
Narrowing after widening (1 pass) Always Medium-High Extra pass cost; still loses precision on non-obvious invariants

Recommendation. Threshold widening with N=3 and a language-aware threshold set: {Null, 0, 1, -1, sys.maxsize, -sys.maxsize} for Python integers + nullability; {null, undefined, 0, 1, -1, Number.MAX_SAFE_INTEGER} for TS. One narrowing pass after widening. Lattice height per variable is bounded at 8 (4 bits) — the analyzer stops refining a variable after 8 lattice visits and marks it . Dynamic-analysis engineer pushed for no widening ("the whole point is precision"); static-analysis researcher overrode with termination being non-negotiable — Lich must never hang on a user's loop.

Pitfall. Abstract interpretation is sound on the properties it models and silent on everything else. M1 will miss bugs it was never told to look for — e.g., a dict.get(k) returning None won't be caught unless the nullability domain explicitly models Optional[T] unpacking. Document this limit in the sub-plugin README; don't sell soundness as correctness. Source: Cousot & Cousot POPL'77.


Layer 3: M2 Falleri Structural Diff — GumTree Parameter Defaults

Prior art. Falleri et al. ASE 2014 "Fine-grained and Accurate Source Code Differencing" introduced GumTree: a two-phase algorithm that (1) greedily matches isomorphic subtrees top-down by hash (phase 1), then (2) matches remaining nodes bottom-up by Dice coefficient on descendant-match ratios (phase 2). Key parameters: min_height (don't match single-leaf subtrees — too noisy), min_dice (similarity threshold for bottom-up containment match), min_similarity (accept partial matches only above this threshold). Paper defaults: min_height=2, min_dice=0.5, min_similarity=0.5. These are tuned for fine-grained differencing across arbitrary code; code-review cares more about semantic edits than perfect granularity.

Options comparison.

Parameter set Move/rename recovery False-positive noise Runtime on 500-LOC diff Pitfall
Paper defaults (2, 0.5, 0.5) High Medium (many leaf-level matches) 250ms Reviewer sees too many "trivial rename" edits
Conservative (3, 0.6, 0.7) Medium-high Low 180ms Misses some cross-function moves
Aggressive (1, 0.3, 0.3) Very high High (false moves on similar subtrees) 400ms Match pollution — spurious move edits confuse reviewer

Recommendation. Conservative defaults: min_height=3, min_dice=0.6, min_similarity=0.7. Code-review use prefers fewer, higher-confidence semantic edits over exhaustive fine-grained differencing. Reviewer's attention is the scarce resource; M2's job is to surface the 3-5 semantic edits that matter, not the 40-node leaf-match noise. When M2 is uncertain (partial match below threshold), it falls back to the hunk's unified diff and flags the fallback — honest-numbers contract.

Pitfall. GumTree's bottom-up phase is O(n·m) in worst case on dense subtree similarity. On a 10k-LOC file the analyzer can spike to several seconds. M2 enforces a 2-second per-file budget; if exceeded, it times out and emits structural-diff-timeout, falling back to unified diff. Never block the reviewer on a pathological case. Source: Falleri et al. ASE'14 preprint (HAL).


Layer 4: M5 Bounded Subprocess Dry-Run — Sandbox Policy

Prior art. Python stdlib resource module exposes setrlimit(RLIMIT_CPU, seconds), setrlimit(RLIMIT_AS, bytes) (address space), setrlimit(RLIMIT_NOFILE, count) (open file descriptors), setrlimit(RLIMIT_FSIZE, bytes) (max file size). Available on POSIX only — absent from Windows stdlib. signal.alarm(seconds) fires SIGALRM after wall-clock time; subprocess.run(..., timeout=) is the portable fallback. Docker/Firecracker/gVisor give stronger isolation but are explicitly out-of-scope (zero-dep invariant). Pyodide ships Python-in-Wasm — 10MB+ wheel, disqualified. CrossHair (pschanely/CrossHair) is the closest pure-Python precedent but requires z3-solver wheel. The minimum sandbox that catches the developer's canonical 1/0 example is: fork a subprocess, set rlimits in a preexec_fn, set signal.alarm to a shorter wall-clock cap, feed the synthesized input to a runpy-wrapped function, catch stderr + exit code, parse.

Options comparison.

Isolation level Ship weight Windows support ACE risk Pitfall
resource.setrlimit + signal.alarm + subprocess + env scrub 0MB No Low (capped CPU/RSS/FD, no network by env) Unix-only; Windows falls back to timeout-only, weaker
Docker rootless 10s of MB on host + daemon Yes via Docker Desktop Very low Ship-weight; assumes Docker installed
Firecracker microVM 100+ MB No (Linux KVM only) Negligible Disqualified — ship-weight + kernel requirement
No sandbox (just subprocess + timeout) 0MB Yes Medium-High Network calls, filesystem writes, fork bombs

Recommendation. Stdlib resource.setrlimit + signal.alarm on Unix, exposed as the lich-sandbox sub-plugin. Exact caps at launch:

  • RLIMIT_CPU = 5 (5 CPU-seconds; infinite loop ≈ timeout at 5s of CPU)
  • RLIMIT_AS = 512 * 1024 * 1024 (512 MB address space cap)
  • RLIMIT_NOFILE = 16 (16 open FDs; enough for stdlib + 3-4 temp files)
  • RLIMIT_FSIZE = 10 * 1024 * 1024 (10 MB per-file write cap)
  • signal.alarm(10) (10s wall-clock; kicks in ahead of CPU on pathological I/O)

Network isolation: scrub HTTP_PROXY, HTTPS_PROXY, and set no_proxy=*; refuse to ship sockets in the harness; document that developers running malicious code through Lich should also use OS-level network blocks (the plugin cannot fully guarantee network denial from Python alone). Filesystem isolation: write-target limited to a per-run tempfile.mkdtemp() that's deleted on exit; reads are unrestricted (analysis-only, no containment of reads). Windows: skip M5 entirely at launch; emit platform-unsupported in the verdict with an honest note. Phase 2 adds a Job Objects backend for Windows.

Pitfall. Sandbox without caps is arbitrary-code-execution on every PR — the plugin becomes an attack vector. The resource.setrlimit caps are load-bearing; any relaxation requires a documented security review. Static-analysis researcher and plugin-brand architect aligned strongly here; fuzzing engineer noted that sandbox-escape via untrusted C extensions loaded by Python is still possible (via ctypes with a forged shared library). Mitigation: sandbox runs under a non-privileged user whenever possible, and the Lich README flags this limit honestly.


Layer 5: M5 Input Synthesis — Boundary Values at Launch, M4 Phase 2

Prior art. Hypothesis (HypothesisWorks/hypothesis) ghostwriter inspects function signatures + type annotations and synthesizes @given test stubs in seconds using inspect.signature + typing.get_type_hints. Property-based testing's "boundary values" tradition: for each input type, try the 5-10 values most likely to break generic functions (0, -1, None, "", [], {}, sys.maxsize). For M1-flagged variables specifically, the flag itself names the suspect value (n ∈ [-∞, +∞] suspected at x / n → try n = 0).

Options comparison.

Synthesis strategy Coverage on 10 failure classes Dep weight Pitfall
Boundary values from flag Div-zero ✓, null ✓, OOB partial, overflow ✓ 0MB Misses structure-sensitive bugs (need valid parse, invalid content)
Hypothesis ghostwriter (external) 9/10 classes hypothesis wheel (~2MB) Violates zero-dep if bundled; optional-install acceptable
CrossHair symbolic execution 10/10 but slow z3-solver wheel (~8MB) Out-of-dep scope; invoke only if developer has it installed
Random + coverage-guided 8/10 0MB Slow convergence; no guarantee on bug-triggering inputs within budget

Recommendation. MVP (Phase 1) uses boundary-value synthesis driven directly by M1 flags: for each flagged variable, Lich tries a per-type default set ({0, -1, None, "", [], sys.maxsize} for Python int/str/list), runs the containing function in M5's sandbox, catches ZeroDivisionError, TypeError, IndexError, OverflowError. If Hypothesis is installed in the developer's environment, Lich's adapter detects it and upgrades to ghostwriter-synthesized stubs (Phase 2 engine M4 — gated behind Hypothesis availability). CrossHair is the Phase 3 moonshot, behind the developer's explicit opt-in.

Failure class Boundary synthesis Hypothesis (P2) CrossHair (P3)
Divide-by-zero ✓ (from M1 n=0)
Null/None deref
Array OOB partial
Integer overflow ✓ (sys.maxsize + 1)
Unhandled exception
Infinite loop ✓ (via signal.alarm) partial

Pitfall. Boundary-value synthesis is weakest on structure-sensitive inputs — a function that parses JSON won't fail on "0" but will fail on '{"malformed": }'. M1+M5 explicitly does not claim to find those; that's Phase 2 M4 territory. Document the coverage honestly.


Layer 6: M6 Bayesian Preference Accumulation — Priors, Updates, Floor

Prior art. Thompson 1933 introduced Thompson sampling for the two-armed bandit. Russo & Van Roy (Stanford 2018) "A Tutorial on Thompson Sampling" is the modern reference. For binary preference (accept/reject a rule's finding), Beta-Binomial is the conjugate prior: Beta(α, β) parameterizes the posterior after α-1 accepts and β-1 rejects (starting from Beta(1,1) uniform). Sampling surfaces a rule with probability proportional to a draw from its Beta posterior — rules with high mean and low variance surface often; rules with high variance still get exploration. No shipped reviewer does this today: Copilot uses markdown copilot-instructions.md; Cursor uses "memories" (auto-generated notes, not a posterior); Qodo Merge accumulates team-level rules, not per-developer. JetBrains Mellum uses ML-ranked completion ordering, not rule-level preference. Lich M6 is genuinely novel territory.

Options comparison.

Algorithm Cold-start Over-fit resistance Uncertainty-native Pitfall
Beta-Binomial Thompson Uniform Beta(1,1) Strong (two rejections moves posterior by ~0.1 mean) Yes (sample from posterior) Cold-start means high variance for new developer; paired with Phase 2 Cohort Similarity
Logistic regression on features Zero-weight coefficients Weak (drifts on few samples) No native Needs feature engineering; not Bayesian
Elo Each rule at 1500 Medium No Competitive-pairs vocabulary; doesn't encode individual accept-rate cleanly
Contextual bandit (LinUCB) Uniform Strong with regularization Yes (confidence bound) Needs context vector per rule; over-engineered for MVP

Recommendation. Beta-Binomial Thompson sampling. Initial prior Beta(α=1, β=1) uniform for all (developer, rule) pairs. Update rule: accept → α++; reject → β++; overrode → treat as 0.5 accept + 0.5 reject (ambiguous signal, update both). Surface probability on each review pass is a Thompson sample from the posterior. Minimum sampling floor: 5% (if the Thompson sample would surface below 5%, upgrade it to 5%). This prevents a rule from being "dead" after a few rejections; the developer can permanently suppress via /lich-disable <rule> which writes to an overrides file. Overrides auto-expire with a quarterly "still disabled?" prompt — anti-drift protocol.

Persistence: plugins/lich-preference/state/learnings.json — one file per developer ID, containing a map {rule_id → {alpha, beta, surface_count, accept_count, reject_count, last_update}}. Atomic writes via os.rename (brand standard A4 Atomic State Serialization pattern).

Pitfall. Preference expert warned of the cold-start problem: for a new developer, all posteriors are Beta(1,1), so all rules surface at 50% probability on PR #1. Developer sees ~2× more surfaces than a tuned system, potentially causing churn-rejection before the posterior informs. Mitigation: ship Phase 2's Cohort Similarity Borrowing (inherit priors from similar developers on the same repo). MVP accepts the 2-3-week cold-start noise as an honest tradeoff. Source: Thompson sampling tutorial.


Layer 7: M7 Zheng Pairwise Rubric Judgment — Rubric & Debiasing

Prior art. Zheng et al. 2023 "MT-Bench" demonstrated LLM-as-judge with pairwise comparisons, identifying position bias (the first-presented answer scores higher on average) and self-preference bias (a judge prefers outputs from its own model family). Mitigations: swap order and average, use a different model family for judge vs. judged, and report inter-judge reliability via Cohen's Kappa. Recursive Rubric Decomposition (2025) showed that breaking a monolithic quality metric into 5-8 independent sub-axes and judging each axis separately reduces variance vs. a single holistic score. Code-review-specific rubric work (CodeScore 2024, GPTScore extensions) narrows the axes to: Clarity, Correctness, Idiom-fit, Testability, Simplicity, Maintainability, Performance-awareness.

Options comparison.

Rubric design Variance Reviewer alignment Cost (tokens/review) Pitfall
1 holistic axis High Low ~500 High judge-to-judge disagreement
3 axes (Clarity / Correctness / Idiom) Medium Medium ~1500 Misses testability and simplicity signals
5 axes (above + Testability + Simplicity) Low High ~2500 Slightly expensive; may over-surface rubric noise
8 axes Low High but with overlap (Simplicity ⊆ Clarity?) ~4000 Axis correlation; axes should be orthogonal

Recommendation. 5 orthogonal axes at launch:

  1. Clarity — Can a reviewer summarize this function's purpose in one sentence after 30 seconds?
  2. Correctness-at-glance — Are the guards (null checks, boundary checks, error handling) visible without tracing?
  3. Idiom-fit — Does the code match this language's and this repo's conventions?
  4. Testability — Are there seams for unit tests without mocking the universe?
  5. Simplicity — Is the solution the simplest that solves the stated problem?

Scoring scale 1-5 per axis (not 1-10; 1-5 is easier to justify across judges). Debiasing protocol: for every diff, judge twice with (before, after) then (after, before); if per-axis delta ≥ 1.5, escalate to Opus adjudicator. Inter-judge reliability: Cohen's Kappa computed per axis across the two runs; Kappa < 0.4 flags the axis as "unstable" in the PDF report — honest-numbers signal, never hidden behind an average.

Model tier: Sonnet default judge. Haiku when Pech's pech.budget.threshold.crossed fires at 80% (cost-aware downshift, brand-standard cost contract). Opus reserved for disagreement adjudication only — cost-contract respected.

Pitfall. Position bias is real even with a single swap — a 2-sample average is still noisy. Lich's Kappa reporting is the honest-numbers floor: if Kappa is unstable, the axis output is explicitly flagged rather than averaged silently. Preference expert pushed for 8 axes; static-analysis researcher pushed for 3. Compromise: 5 axes at launch, Phase 2 survey data from real review flows informs whether to expand or consolidate. Source: MT-Bench (Zheng et al. 2023).


Layer 8: Verdict Contract — DEPLOY / HOLD / FAIL Thresholds

Prior art. Wixie's CLAUDE.md defines the three-verdict vocabulary: DEPLOY (σ < 0.45 AND overall ≥ 9.0 AND all 5 axes ≥ 7.0 AND 8/8 SAT assertions pass), HOLD (any axis < 7.0 or σ ≥ 0.45), FAIL (reviewer flags a structural issue). Lich adopts the same three-verdict shape but defines its own threshold math because the axes and assertions differ.

Options comparison.

Verdict math Interpretability False-positive rate (est) Pitfall
Any HIGH/CRITICAL M1 finding → FAIL High Low Brittle on false-positive prone rules
Composite score blending M1/M5/M7 with weights Medium Medium Weights are vibes unless tuned
Hard floor per engine with explicit AND High Medium No composite signal across engines
Ensemble: HARD floors per engine AND composite tie-breaker High Low More logic to document

Recommendation. Ensemble with hard floors plus a composite tie-breaker:

Verdict M1 condition M5 condition M6 condition M7 condition Action
DEPLOY All flagged sites severity < HIGH No confirmed runtime failure ≥ 80% surfaced findings have posterior mean > 0.5 All 5 axes ≥ 3.5/5 AND Kappa ≥ 0.4 Silent pass; write DEPLOY to state/verdict.jsonl
HOLD 1-2 HIGH flags AND no CRITICAL Any timeout-without-confirmation ≥ 50% posterior > 0.3 Any axis < 3.5 OR Kappa < 0.4 Surface to reviewer with per-engine breakdown; Sylph's pre-commit gate warns
FAIL Any CRITICAL flag OR ≥ 3 HIGH flags Any confirmed runtime failure (N/A — posterior doesn't downgrade from HOLD to FAIL) Any axis ≤ 2/5 OR > 2 axes < 3 Block; Sylph's pre-commit gate refuses auto-commit; developer acknowledgment required

Pitfall. A single M5-confirmed runtime failure (e.g., div-by-zero with a concrete witness) is load-bearing evidence — it's not a probability, it's a fact. The verdict math treats confirmed failures as hard FAIL triggers, not soft scores. Inversely, M7 rubric scores are genuinely subjective; they inform HOLD but can never single-handedly fail a PR. This asymmetry is intentional and documented. Source: Wixie CLAUDE.md § DEPLOY bar.


Layer 9: Hydra, Crow, Emu, Pech, Sylph Integration Contract

Prior art. The enchanter-ai ecosystem uses per-plugin audit.jsonl files as the Phase 1 source of truth (enchanted-mcp event bus is Phase 2). Hydra's plugins/vuln-detector/state/audit.jsonl records are shape {event:"vuln_detected", ts, file, line, vuln_id, cwe, severity, description, language, tool} (verified via direct file inspection). Crow's plugins/change-tracker/state/audit.jsonl records classify changes. Emu publishes token/cost metrics to plugins/*/state/metrics.jsonl. Pech (building) will publish budget-threshold events. Sylph's pre-commit gate already subscribes to pech.budget.threshold.crossed and will subscribe to lich.review.completed per brand-standard event-envelope convention.

Options comparison.

Consumption mechanism Coupling Phase 1 ready Pitfall
Tail-read sibling audit.jsonl Tight (file path + record shape) Yes Breaks if sibling renames files or changes schema
MCP event subscription Loose (envelope named, sibling implements) No (Phase 2) Forward-looking only
Shared SQLite Loose Would need new plugin Over-engineered for MVP
stdin/stdout pipe Tight, runtime-coupled No (Claude Code doesn't pipe plugins) Not a supported mechanism

Recommendation. Phase 1 uses file-tailing of sibling audit.jsonl files with a pinned record-shape contract in Lich's hooks/record-shapes.json. Phase 2 migrates to MCP event subscription with the same payload shape.

Record shapes Lich consumes (Phase 1):

// hydra/plugins/vuln-detector/state/audit.jsonl → used as M6 attention weight
{"event":"vuln_detected","ts":"2026-04-19T12:34:56Z","file":"src/api.ts","line":42,
 "vuln_id":"sql-injection-template-literal","cwe":"CWE-89","severity":"critical",
 "description":"...","language":"typescript","tool":"Write"}

// crow/plugins/change-tracker/state/audit.jsonl → used to detect what Lich should review
{"event":"change.classified","ts":"2026-04-19T12:34:57Z","file":"src/api.ts",
 "classification":"behavior-change","trust_score":0.62,"hunks":[...]}

Record shape Lich emits (plugins/verdict/state/verdict.jsonl):

{"event":"review.completed","ts":"2026-04-19T12:35:02Z","pr_id":"local-20260419-1234",
 "verdict":"HOLD","engines":{"M1":{...},"M2":{...},"M5":{...},"M6":{...},"M7":{...}},
 "rubric_scores":{"clarity":4,"correctness":3,"idiom":4,"testability":3,"simplicity":4},
 "kappa":{"clarity":0.72,"correctness":0.68,...}}

Non-duplication invariants.

  • Lich NEVER re-scans for CWE-89 SQL injection, CWE-79 XSS, CWE-918 SSRF, or any other CWE-tagged security finding. These are Hydra R3's lane. If Hydra's audit.jsonl already has a CRITICAL CWE on the file, Lich increases review-attention weight for M6 and adds a "Security context: Hydra flagged {cwe} {severity}" note to M7's rubric input, but does not re-classify the finding.
  • Lich NEVER re-classifies a change. Crow's V1 Semantic Diff + V2 Bayesian Trust output is authoritative. Lich is a consumer, not a peer classifier.
  • Lich consumes but does not mutate Emu's token metrics. M7 judge-tier downshift to Haiku under Pech budget pressure is the only cross-plugin control flow Lich exercises.

Event-Bus Contract

Publishes:

Event Trigger Payload Consumer examples
lich.review.completed End of PostToolUse Lich review {pr_id, verdict, engines{M1..M7}, rubric_scores, kappa} Sylph pre-commit gate, Pech cost attribution, audit dashboard
lich.rule.disabled Developer runs /lich-disable <rule> {developer_id, rule_id, ts, expiry_ts} Preference-accumulator archival; quarterly re-prompt scheduler
lich.sandbox.failed M5 sandbox errored (not a bug — infra failure) {file, function, error_class, retry_budget} Observability; never inflates a verdict

Subscribes:

Event Source Effect
crow.change.classified Crow Triggers Lich review on hunks above a configurable trust threshold
hydra.vuln.detected Hydra Boosts M6 attention weight for the affected file; adds security context note to M7 input
pech.budget.threshold.crossed Pech Downshifts M7 judge from Sonnet → Haiku at 80% budget
emu.runway.threshold.crossed Emu Pauses M5 sandbox runs (CPU budget conservation) until runway recovers

Layer 10: Sub-Plugin Breakdown & Developer Query Path

Prior art. Emu-style marketplace: one sub-plugin per engine OR one per orthogonal concern (language adapter, cross-plugin router, developer-UX surface). Hydra ships 5 sub-plugins (secret-scanner, vuln-detector, action-guard, config-shield, audit-trail); Sylph ships 5 similarly organized. The meta full plugin pulls all siblings via dependency resolution.

Options comparison.

Sub-plugin count Cognitive load Meta-plugin complexity Pitfall
3 (core + sandbox + verdict) Low Simple Buries language adapters inside core; user can't disable one
5 (core + sandbox + preference + rubric + verdict) Medium Simple Language adapters still bundled
7 (above + lich-python + lich-typescript) Medium-high Simple Adapters are the right surface but add install ceremony
10+ High Complex Over-sliced; developers confused

Recommendation. 7 sub-plugins at launch + full meta:

Sub-plugin Owns Category
lich-core M1 Cousot Interval Propagation + M2 Falleri Structural Diff + verdict synthesizer analyzer
lich-sandbox M5 Bounded Subprocess Dry-Run (Unix-only at launch) analyzer
lich-preference M6 Bayesian Preference Accumulation learning
lich-rubric M7 Zheng Pairwise Rubric Judgment judgment
lich-python Python language adapter (ruff rule-ID mapping, Python-specific idioms) adapter
lich-typescript TypeScript language adapter (biome rule-ID mapping, TS-specific idioms) adapter
lich-verdict Cross-sub-plugin verdict composition + event emission router
full Meta-plugin pulling all 7 via dependency resolution meta

Developer surfaces Day 1:

  • Skill: /lich-review <hunk|file|PR> — on-demand deep review
  • Skill: /lich-explain <finding_id> — walks through why M1/M5/M7 flagged something
  • Skill: /lich-disable <rule_id> — permanent suppression with auto-reprompt quarterly
  • Hook (PostToolUse, Write|Edit): passive review on every change; silent on DEPLOY, loud on HOLD/FAIL
  • Status-line badge: "M: 2 HOLDs, 1 FAIL" — ambient awareness
  • PDF report: dark-themed, per-session, showing M1 findings heatmap + M7 radar chart + M6 posterior histogram — post-session audit

Deferred: web dashboard (Phase 4 with enchanted-mcp), IDE hover-tips (VSCode extension, Phase 3), Slack bot (Phase 4).

Pitfall. Every sub-plugin is a maintenance surface. 7 is the upper bound of what a two-week MVP can credibly scaffold and keep healthy. If any sub-plugin ships without tests or a clear owner-module, it drags the whole plugin's reliability. Plugin-brand architect overrode preference expert's push for 9 sub-plugins (adding separate rubric-judge and rubric-aggregator); collapsed into lich-rubric to stay within scope.


Recommended Full Stack Summary

Layer Choice Why Pitfall avoided
Language substrate — Python stdlib ast Zero-dep; canonical N/A
Language substrate — TypeScript tsc --generateTrace subprocess, graceful fallback Compiler-grade precision when available, honest skip when not Hidden silent failure
M1 abstract domains Intervals + nullability + container-shape Catches 4 of the 10 runtime-failure classes statically Over-engineering heap analysis for MVP
M1 widening Threshold widening, N=3, language-aware thresholds Guarantees termination without collapsing to ⊤ Infinite loop on user's loop
M2 GumTree defaults Conservative (3, 0.6, 0.7) + 2s budget Reviewer attention is the scarce resource Noise from leaf-level matches
M5 sandbox resource.setrlimit + signal.alarm + env scrub Stdlib-only, catches 80% of runtime failures ACE risk from uncapped execution
M5 platform Unix-only at launch; Windows skips with honest note resource module absent on Windows Silent pretense that M5 ran
M5 input synthesis Boundary values from M1 flags; Hypothesis upgrade if installed Zero-dep MVP with graceful upgrade Over-commit to Hypothesis as hard dep
M6 algorithm Beta-Binomial Thompson sampling with 5% floor Bayesian, uncertainty-aware, matches Gauss Accumulation brand Rule death from single rejection
M7 rubric 5 orthogonal axes, position-swap + Cohen's Kappa Honest-numbers on subjective scoring Hidden judge disagreement
M7 judge tiers Sonnet default, Haiku under Pech budget, Opus adjudication Cost contract with Pech Uncontrolled Opus spend
Verdict DEPLOY/HOLD/FAIL with hard floors + composite Interpretable + catches both kinds of bugs Weighted-score vibes
Hydra integration File-read of vuln-detector/state/audit.jsonl (P1), event (P2) Non-duplication of R3 Double-reporting CWEs
Crow integration Subscribe to crow.change.classified Consumer, not re-classifier Peer-level drift
Sub-plugin count 7 + full meta Emu-style sliceable granularity Over-slicing

Plugin Package Layout

lich/
├── .claude-plugin/
│   └── marketplace.json
├── CLAUDE.md
├── CONTRIBUTING.md
├── README.md
├── install.sh
├── LICENSE
├── shared/
│   ├── conduct/              (10 behavioral modules — unchanged from schematic)
│   ├── constants.sh
│   ├── metrics.sh
│   ├── sanitize.sh
│   └── scripts/
├── plugins/
│   ├── lich-core/
│   │   ├── .claude-plugin/plugin.json
│   │   ├── agents/lich-analyzer.md
│   │   ├── commands/lich-review.md
│   │   ├── hooks/hooks.json
│   │   ├── skills/lich-review/SKILL.md
│   │   ├── state/{learnings.json,precedent-log.md}
│   │   └── README.md
│   ├── lich-sandbox/
│   │   ├── .claude-plugin/plugin.json
│   │   ├── agents/lich-sandbox-runner.md
│   │   ├── hooks/hooks.json
│   │   ├── skills/lich-sandbox/SKILL.md
│   │   ├── state/
│   │   └── README.md
│   ├── lich-preference/
│   │   ├── .claude-plugin/plugin.json
│   │   ├── agents/lich-preference-updater.md
│   │   ├── hooks/hooks.json
│   │   ├── skills/lich-disable/SKILL.md
│   │   ├── state/{learnings.json,overrides.json}
│   │   └── README.md
│   ├── lich-rubric/
│   │   ├── .claude-plugin/plugin.json
│   │   ├── agents/lich-judge.md
│   │   ├── config/rubric-v1.json
│   │   ├── hooks/hooks.json
│   │   ├── skills/lich-explain/SKILL.md
│   │   ├── state/kappa-log.jsonl
│   │   └── README.md
│   ├── lich-python/
│   │   ├── .claude-plugin/plugin.json
│   │   ├── hooks/hooks.json
│   │   ├── skills/lich-python/SKILL.md
│   │   ├── config/ruff-rule-map.json
│   │   └── README.md
│   ├── lich-typescript/
│   │   ├── .claude-plugin/plugin.json
│   │   ├── hooks/hooks.json
│   │   ├── skills/lich-typescript/SKILL.md
│   │   ├── config/biome-rule-map.json
│   │   └── README.md
│   ├── lich-verdict/
│   │   ├── .claude-plugin/plugin.json
│   │   ├── agents/lich-verdict-synthesizer.md
│   │   ├── hooks/hooks.json
│   │   ├── state/verdict.jsonl
│   │   └── README.md
│   └── full/
│       ├── .claude-plugin/plugin.json
│       └── README.md
├── docs/
│   ├── architecture/
│   │   ├── generate.py
│   │   ├── lich-architecture.md    ← this document
│   │   ├── highlevel.mmd
│   │   ├── dataflow.mmd
│   │   ├── lifecycle.mmd
│   │   ├── hooks.mmd
│   │   └── index.html
│   └── brand-guide.md
└── tests/
    └── run-all.sh

Lich Named Engines

ID Name Sub-plugin Algorithm Source
M1 Cousot Interval Propagation lich-core Abstract interpretation over interval + nullability + container-shape lattices with threshold widening Cousot & Cousot POPL'77
M2 Falleri Structural Diff lich-core GumTree two-phase AST matching (top-down hash + bottom-up Dice) Falleri et al. ASE'14
M3 Yamaguchi Property-Graph Traversal (Phase 2) Code Property Graph over unified AST+CFG+PDG with Gremlin-like queries Yamaguchi et al. S&P'14 (Joern)
M4 Type-Reflected Invariant Synthesis (Phase 2) Hypothesis-ghostwriter-style synthesis from inspect.signature + typing.get_type_hints Hypothesis project
M5 Bounded Subprocess Dry-Run lich-sandbox Stdlib resource.setrlimit + signal.alarm + subprocess sandbox Python stdlib; novel composition for code review
M6 Bayesian Preference Accumulation lich-preference Beta-Binomial Thompson sampling per (developer, rule) with 5% minimum sampling floor Thompson 1933; Russo & Van Roy 2018
M7 Zheng Pairwise Rubric Judgment lich-rubric 5-axis rubric + position-swap debiasing + Cohen's Kappa inter-judge reliability Zheng et al. MT-Bench 2023

Event-Bus Contract

Publishes:

Event Trigger Payload shape Consumer examples
lich.review.completed End of Lich review pass {pr_id, verdict, engines, rubric_scores, kappa} Sylph pre-commit gate; Pech attribution
lich.rule.disabled Developer /lich-disable {developer_id, rule_id, expiry_ts} Preference archival; re-prompt scheduler
lich.sandbox.failed M5 infra failure (not a finding) {file, function, error_class} Observability

Subscribes:

Event Source Effect on Lich
crow.change.classified Crow Trigger review on affected hunks
hydra.vuln.detected Hydra Boost M6 attention weight; annotate M7 input
pech.budget.threshold.crossed Pech Downshift M7 judge (Sonnet → Haiku) at 80%
emu.runway.threshold.crossed Emu Pause M5 sandbox runs until runway recovers

Runtime-Failure Coverage Matrix

Columns: M1 (Cousot Interval Propagation), M5 (Bounded Subprocess Dry-Run), M6 (Bayesian prioritization — ranks which findings reviewer sees first, not a detector), M7 (Zheng Pairwise Rubric — style/clarity signals).

Failure class M1 M5 M6 M7 Notes
Divide-by-zero ranks Canonical case; flagged statically, confirmed dynamically
Null / None deref ranks Via nullability domain
Array out-of-bounds partial ranks M1 weak on unknown-length inputs; M5 catches
Integer overflow / underflow partial ranks M1 with sys.maxsize threshold; M5 confirms on Python (rarer than C)
Unhandled exception propagation partial ranks M5 catches via stderr inspection
Race conditions / deadlocks ranks partial Out of scope for MVP; M7 may note concurrency concerns
Infinite loops / unbounded recursion partial ranks M5's signal.alarm catches
Memory leaks / use-after-free (native) ranks Requires Phase 2 M3 + Phase 3 CrossHair
Resource leaks (file handles, DB conns) partial ranks M1 flags unclosed open(); M5 confirms via FD count; M7 rubric notes
Time-of-check-to-time-of-use (TOCTOU) ranks partial Out of MVP; Hydra's R3 catches security-flavored TOCTOU; Lich covers dataflow variant in Phase 2

(Note: CWE-tagged security sinks — SQL injection, XSS, SSRF, etc. — are explicitly Hydra's R3 lane, not Lich's. The row is omitted here because it's not a Lich responsibility.)

Language Adapter Contract

Adapter Top linter mapped Rule count at launch Universal rules (from common core) Language-specific rules
lich-python ruff (astral-sh) ~120 mapped / ~900 available 40 (dead code, complexity, naming, duplicates) 80 Python-only (pyupgrade idioms, list/dict-comp, async-await patterns, typing modernization)
lich-typescript biome (biomejs) ~80 mapped / ~423 available 35 (same universal core) 45 TS-only (hooks exhaustive deps, JSX a11y, any-avoidance, narrow-type-guards)
(Phase 2) lich-rust clippy ~80 mapped / ~550 available 35 (same core) 45 Rust-only (borrow idioms, lifetime placement, clone avoidance)
(Phase 2) lich-go golangci-lint ~40 mapped / ~100 aggregator 30 (same core) 10 Go-only (err-check-pattern, context-propagation)
(Phase 2) lich-java Checkstyle + PMD ~80 mapped / ~600 total 40 40 Java-only (Effective Java items)

Preference Posterior Schema

{
  "schema_version": "1.0",
  "developer_id": "git-email-sha256-truncated-to-12",
  "repo_id": "path-sha256-truncated-to-12",
  "last_update": "2026-04-19T12:34:56Z",
  "rules": {
    "lich-python:unused-import": {
      "alpha": 3,
      "beta": 7,
      "surface_count": 10,
      "accept_count": 2,
      "reject_count": 6,
      "override_count": 2,
      "last_update": "2026-04-19T12:34:56Z"
    },
    "lich-core:M1-division-by-zero": {
      "alpha": 8,
      "beta": 2,
      "surface_count": 10,
      "accept_count": 7,
      "reject_count": 1,
      "override_count": 2,
      "last_update": "2026-04-19T12:34:56Z"
    }
  },
  "cohort_prior": {
    "source": "repo-cohort",
    "version": "phase-2",
    "_note": "Phase 2 — populated by Cohort Similarity Borrowing"
  }
}

Rubric Schema

{
  "rubric_version": "1.0",
  "axes": [
    {"id": "clarity", "scale": 5, "definition": "Can a reviewer summarize this function's purpose in one sentence after 30 seconds?"},
    {"id": "correctness_at_glance", "scale": 5, "definition": "Are guards (null checks, boundary checks, error handling) visible without tracing?"},
    {"id": "idiom_fit", "scale": 5, "definition": "Does the code match this language's and this repo's conventions?"},
    {"id": "testability", "scale": 5, "definition": "Are there seams for unit tests without mocking the universe?"},
    {"id": "simplicity", "scale": 5, "definition": "Is the solution the simplest that solves the stated problem?"}
  ],
  "scoring_scale": {"min": 1, "max": 5, "half_points": false},
  "position_swap": {"enabled": true, "runs": 2, "escalate_if_delta_gte": 1.5},
  "kappa": {"report_per_axis": true, "unstable_threshold": 0.4, "action_on_unstable": "flag_in_pdf"},
  "judge_model_tiers": {
    "default": "claude-sonnet-4-6",
    "under_budget_pressure": "claude-haiku-4-5-20251001",
    "adjudication": "claude-opus-4-7"
  },
  "prompt_template_ref": "plugins/lich-rubric/skills/lich-explain/rubric-prompt.md"
}

Verdict Contract

Verdict M1 condition M5 condition M6 condition M7 condition Action
DEPLOY All flagged sites severity < HIGH No confirmed runtime failure ≥ 80% surfaced findings posterior mean > 0.5 All 5 axes ≥ 3.5/5 AND Kappa ≥ 0.4 Silent pass; write to state/verdict.jsonl
HOLD 1-2 HIGH flags, no CRITICAL Any timeout-without-confirmation ≥ 50% surfaced posterior > 0.3 Any axis < 3.5 OR Kappa < 0.4 Surface to reviewer; Sylph warns
FAIL Any CRITICAL OR ≥ 3 HIGH flags Any confirmed runtime failure (N/A — posterior doesn't failure-downgrade) Any axis ≤ 2 OR > 2 axes < 3 Block; Sylph refuses auto-commit

MVP vs. Full Build

Phase 1 — 2-week MVP.

  • Engines: M1 Cousot Interval Propagation + M2 Falleri Structural Diff + M5 Bounded Subprocess Dry-Run + M6 Bayesian Preference Accumulation + M7 Zheng Pairwise Rubric Judgment.
  • Languages: lich-python + lich-typescript only.
  • Platform: Unix-only for M5; Windows skips M5 with honest note.
  • Integration: File-based reads from Hydra/Crow audit.jsonl; no MCP yet.
  • Surfaces: /lich-review, /lich-explain, /lich-disable, PostToolUse hook, status-line badge, PDF report.
  • Exit criteria: all 7 sub-plugins installable via full meta, pass smoke test tests/run-all.sh, architecture doc shipped.

Phase 2 — 2-3 month full build.

  • Engines added: M3 Yamaguchi Property-Graph Traversal (Joern-style CPG substrate), M4 Type-Reflected Invariant Synthesis (Hypothesis-ghostwriter upgrade to M5 input synthesis), Schleimer Winnowing Clone Detection (code duplicate detection), O'Hearn Separation-Logic Bi-Abduction (Java/C++/ObjC resource-ownership), Cohort Similarity Borrowing (M6 cold-start via cohort priors).
  • Languages added: lich-rust + lich-go + lich-java + lich-kotlin.
  • Platform: M5 Windows support via Job Objects backend.
  • Integration: Migrate file-reads to MCP event bus (crow.change.classified, hydra.vuln.detected, pech.budget.threshold.crossed).
  • Additional surfaces: VSCode extension hover-tips; Slack bot on PR events.

Draft CLAUDE.md

Fills schematic's 8-section canonical shape. See CLAUDE.md for the rendered file; this section summarizes the fills.

  • Shared behavioral modules — unchanged (@shared/vis/conduct/*.md references, vendored from vis).
  • Lifecycle — hybrid trigger. PostToolUse hook (Write|Edit|MultiEdit) drives lich-core, lich-sandbox, lich-preference passes. SessionStart hook (rare — only when config/rubric-v1.json needs refresh). Skill commands (/lich-review, /lich-explain, /lich-disable).
  • Algorithms — M1–M7 named engines; M1+M2 in lich-core, M5 in lich-sandbox, M6 in lich-preference, M7 in lich-rubric. Defining engine: M5 Bounded Subprocess Dry-Run (the novel pipeline moat).
  • Behavioral contracts:
    1. [H] IMPORTANT — Lich never re-scans CWE-tagged security findings. If Hydra's audit.jsonl has a finding on the file, Lich boosts attention weight but does not re-classify. This is the non-duplication contract with Hydra R3.
    2. [H] YOU MUST NOT relax M5 sandbox caps. RLIMIT_CPU=5, RLIMIT_AS=512MB, RLIMIT_NOFILE=16, signal.alarm=10s are load-bearing — relaxing any cap is an ACE risk. Requires documented security review.
    3. [A] YOU MUST report Cohen's Kappa alongside M7 scores. Never average two judges silently when they disagree beyond the per-axis threshold. Honest-numbers contract.
  • Verdict bar — DEPLOY / HOLD / FAIL thresholds per Layer 8's table (plugin-specific section).
  • State paths — runtime (gitignored): plugins/lich-core/state/, plugins/lich-sandbox/state/run-log.jsonl, plugins/lich-preference/state/{learnings.json,overrides.json}, plugins/lich-rubric/state/kappa-log.jsonl, plugins/lich-verdict/state/verdict.jsonl. Ship-time config (committed): plugins/lich-rubric/config/rubric-v1.json, plugins/lich-python/config/ruff-rule-map.json, plugins/lich-typescript/config/biome-rule-map.json.
  • Agent tiers — Opus = M7 disagreement adjudication + cross-engine verdict synthesis; Sonnet = M7 default judge + lich-core analyzer loops; Haiku = M7 budget fallback + rubric-schema freshness audit + M2 structural summarization.
  • Anti-patterns — duplicating Hydra R3; silent M5 skip on Windows; rule-death from single rejection (M6 floor violation); bare M7 score without Kappa; unbounded sandbox.

Handoff to Schematic

Architecture decision Schematic placeholder Fill value
Plugin slug {{PLUGIN_SLUG}} lich
Display name {{PluginName}} Lich
Tagline {{PLUGIN_TAGLINE}} Code review for AI-assisted development — catches runtime failures, learns your preferences, judges quality honestly.
One-line purpose {{PLUGIN_ONE_LINE_PURPOSE}} Answers "Is this code good?" via static suspicion + sandboxed confirmation + Bayesian preference learning + LLM rubric judgment.
Game origin {{PLUGIN_GAME_ORIGIN}} Hollow Knight — Lich Lords (gate-reviewers)
5-questions slot {{PLUGIN_QUESTION}} Is this code good?
Roadmap phase {{PHASE_NUMBER}} 3
Plugin index {{PLUGIN_INDEX}} 6
Engine prefix {{ENGINE_PREFIX}} M
Engine count (MVP) {{ENGINE_COUNT}} 5 (MVP); 7 (full)
Defining engine {{DEFINING_ENGINE_ID}} M5 (Bounded Subprocess Dry-Run — the novel pipeline)
First sub-plugin {{SUB_PLUGIN_1_NAME}} lich-core
Sub-plugin count {{SUB_PLUGIN_COUNT}} 7 + full meta
Trigger model {{TRIGGER_MODEL}} hybrid (PostToolUse hook + skill-invoked)
Events published {{EVENT_PUBLISH_LIST}} lich.review.completed, lich.rule.disabled, lich.sandbox.failed
Events subscribed {{EVENT_SUBSCRIBE_LIST}} crow.change.classified, hydra.vuln.detected, pech.budget.threshold.crossed, emu.runway.threshold.crossed
Repo URL {{REPO_URL}} https://github.com/enchanter-ai/lich
Plugin home dir {{PLUGIN_HOME_DIR}} ~/.claude/plugins/lich

Placeholder gaps (new tokens to propose adding to schematic): none. The architecture's 7-sub-plugin breakdown and 5-axis rubric fit within schematic's existing token set — though the {{SUB_PLUGIN_2_*}} through {{SUB_PLUGIN_7_*}} tokens aren't defined in schematic yet (schematic currently only enumerates sub-plugin 1 placeholders, per pech-architecture's same observation). Recommendation: extend schematic's vocabulary to {{SUB_PLUGIN_N_*}} with a max N of 8.


Generated 2026-04-19. Source prompt: wixie/prompts/lich-architecture/prompt.xml v1. Review workflow: /test-prompt/converge → dispatch. Next step after this document: execute /create pass to fill remaining schematic placeholders into working plugin code.