Motivation
I tested Stokowski's fully automated pipeline (investigate → implement → code-review(fresh) → done) in a real project and compared it against Overstory's multi-worker tournament model. The conclusion: Stokowski's adversarial code-review mechanism (using a session: fresh independent agent to review the previous stage's code) produces production-quality code, at roughly half the token cost of Overstory.
Currently Stokowski supports claude and codex as runners, and the author suggests in workflow.example.yaml: "set runner: codex here to get a second-opinion from a different provider." This design philosophy is excellent — different LLMs writing and reviewing code, cancelling out each other's biases.
However, the open-source community has other capable coding agent CLIs:
| Agent |
Non-interactive mode |
JSON output |
Session resume |
Worktree |
| Pi |
pi -p "..." |
✅ --mode json (JSONL) |
❌ |
❌ |
| Hermes |
hermes chat -q "..." |
❌ (quiet text) |
✅ --resume |
✅ --worktree |
| OpenCode |
via plugin |
❌ |
❌ |
❌ |
All of these have the core capability a Stokowski runner needs: accept a prompt → execute in a workspace → return results.
Proposal
- Abstract the runner interface: unify
build_claude_args / build_codex_args into a runner protocol, so new runners only need to implement build_args() + parse_output()
- Add Pi runner: Pi's
--mode json output format is similar to Claude Code's stream-json (JSONL event stream) — lowest adaptation cost
- Add Hermes runner: Hermes has
--resume and --worktree, closest in capability to Claude Code
Test Data
I ran a real-world task (Python URL shortener with FastAPI + SQLite + pytest) comparing three approaches:
Stokowski fully automated pipeline (claude implement + code-review fresh):
- Total: 11,024 tokens, ~4 minutes
- Output: 6 files, 11 tests, including URL validation, collision retry, parameterized db_path
- The code-review agent automatically fixed security and robustness issues
Overstory multi-worker tournament (lead + 2 builders):
- Total: ~24,000+ tokens, ~6 minutes
- Output: 5 files per builder, 5 tests each
- Lead's merged version was decent but at 2x the token cost
Conclusion: Stokowski's investigate → implement → adversarial review pipeline is the most cost-effective fully automated approach. The current claude + codex dual-runner design has already proven the value of adversarial review. If every stage could use a different vendor's agent — for example, Pi (Gemini) for investigate, Hermes (DeepSeek) for implement, Codex (GPT) for code-review — the adversarial effect would be even stronger, leveraging each model's strengths: some excel at analysis, some at generation, some at finding bugs. And all of this works via pure CLI invocation without Claude Code hooks, making deployment simpler.
Additional: Local tracker experiment
I also adapted a local_tracker.py (replacing Linear as the task source). In simple testing, I found that Stokowski's state machine engine can run independently of Linear, with behavior consistent with the Linear mode. I plan to eventually modify it to use my DataScript database via HTTP for state management, but I would love to see the project ship a "planning with file" mode — allowing users who don't use Linear to drive Stokowski's fully automated pipeline through local files.
I'm happy to contribute a PR for the runner interface abstraction and/or the Pi runner if there's interest.
Motivation
I tested Stokowski's fully automated pipeline (investigate → implement → code-review(fresh) → done) in a real project and compared it against Overstory's multi-worker tournament model. The conclusion: Stokowski's adversarial code-review mechanism (using a
session: freshindependent agent to review the previous stage's code) produces production-quality code, at roughly half the token cost of Overstory.Currently Stokowski supports
claudeandcodexas runners, and the author suggests inworkflow.example.yaml: "set runner: codex here to get a second-opinion from a different provider." This design philosophy is excellent — different LLMs writing and reviewing code, cancelling out each other's biases.However, the open-source community has other capable coding agent CLIs:
pi -p "..."--mode json(JSONL)hermes chat -q "..."--resume--worktreeAll of these have the core capability a Stokowski runner needs: accept a prompt → execute in a workspace → return results.
Proposal
build_claude_args/build_codex_argsinto a runner protocol, so new runners only need to implementbuild_args()+parse_output()--mode jsonoutput format is similar to Claude Code'sstream-json(JSONL event stream) — lowest adaptation cost--resumeand--worktree, closest in capability to Claude CodeTest Data
I ran a real-world task (Python URL shortener with FastAPI + SQLite + pytest) comparing three approaches:
Stokowski fully automated pipeline (claude implement + code-review fresh):
Overstory multi-worker tournament (lead + 2 builders):
Conclusion: Stokowski's investigate → implement → adversarial review pipeline is the most cost-effective fully automated approach. The current claude + codex dual-runner design has already proven the value of adversarial review. If every stage could use a different vendor's agent — for example, Pi (Gemini) for investigate, Hermes (DeepSeek) for implement, Codex (GPT) for code-review — the adversarial effect would be even stronger, leveraging each model's strengths: some excel at analysis, some at generation, some at finding bugs. And all of this works via pure CLI invocation without Claude Code hooks, making deployment simpler.
Additional: Local tracker experiment
I also adapted a
local_tracker.py(replacing Linear as the task source). In simple testing, I found that Stokowski's state machine engine can run independently of Linear, with behavior consistent with the Linear mode. I plan to eventually modify it to use my DataScript database via HTTP for state management, but I would love to see the project ship a "planning with file" mode — allowing users who don't use Linear to drive Stokowski's fully automated pipeline through local files.I'm happy to contribute a PR for the runner interface abstraction and/or the Pi runner if there's interest.