Skip to content

feat: pluggable runner interface — support Pi, Hermes, and other coding agents #18

@WeiYiAcc

Description

@WeiYiAcc

Motivation

I tested Stokowski's fully automated pipeline (investigate → implement → code-review(fresh) → done) in a real project and compared it against Overstory's multi-worker tournament model. The conclusion: Stokowski's adversarial code-review mechanism (using a session: fresh independent agent to review the previous stage's code) produces production-quality code, at roughly half the token cost of Overstory.

Currently Stokowski supports claude and codex as runners, and the author suggests in workflow.example.yaml: "set runner: codex here to get a second-opinion from a different provider." This design philosophy is excellent — different LLMs writing and reviewing code, cancelling out each other's biases.

However, the open-source community has other capable coding agent CLIs:

Agent Non-interactive mode JSON output Session resume Worktree
Pi pi -p "..." --mode json (JSONL)
Hermes hermes chat -q "..." ❌ (quiet text) --resume --worktree
OpenCode via plugin

All of these have the core capability a Stokowski runner needs: accept a prompt → execute in a workspace → return results.

Proposal

  1. Abstract the runner interface: unify build_claude_args / build_codex_args into a runner protocol, so new runners only need to implement build_args() + parse_output()
  2. Add Pi runner: Pi's --mode json output format is similar to Claude Code's stream-json (JSONL event stream) — lowest adaptation cost
  3. Add Hermes runner: Hermes has --resume and --worktree, closest in capability to Claude Code

Test Data

I ran a real-world task (Python URL shortener with FastAPI + SQLite + pytest) comparing three approaches:

Stokowski fully automated pipeline (claude implement + code-review fresh):

  • Total: 11,024 tokens, ~4 minutes
  • Output: 6 files, 11 tests, including URL validation, collision retry, parameterized db_path
  • The code-review agent automatically fixed security and robustness issues

Overstory multi-worker tournament (lead + 2 builders):

  • Total: ~24,000+ tokens, ~6 minutes
  • Output: 5 files per builder, 5 tests each
  • Lead's merged version was decent but at 2x the token cost

Conclusion: Stokowski's investigate → implement → adversarial review pipeline is the most cost-effective fully automated approach. The current claude + codex dual-runner design has already proven the value of adversarial review. If every stage could use a different vendor's agent — for example, Pi (Gemini) for investigate, Hermes (DeepSeek) for implement, Codex (GPT) for code-review — the adversarial effect would be even stronger, leveraging each model's strengths: some excel at analysis, some at generation, some at finding bugs. And all of this works via pure CLI invocation without Claude Code hooks, making deployment simpler.

Additional: Local tracker experiment

I also adapted a local_tracker.py (replacing Linear as the task source). In simple testing, I found that Stokowski's state machine engine can run independently of Linear, with behavior consistent with the Linear mode. I plan to eventually modify it to use my DataScript database via HTTP for state management, but I would love to see the project ship a "planning with file" mode — allowing users who don't use Linear to drive Stokowski's fully automated pipeline through local files.


I'm happy to contribute a PR for the runner interface abstraction and/or the Pi runner if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions