TKO should reach a state where the repository is essentially maintained by AI agents. Not as a novelty, but because the infrastructure โ tests, CI, docs, verified behaviors โ is robust enough that agents can work autonomously and humans can trust the output.
The term "dark factory" comes from manufacturing: a facility that runs without lights because the robots don't need to see. In software, it means agents handle the implementation โ writing code, fixing bugs, updating dependencies, writing docs โ while humans focus on direction, design, and validation.
Simon Willison describes five levels of AI-assisted programming, from "spicy autocomplete" (Level 0) to the fully autonomous "dark software factory" (Level 5). StrongDM's AI team operates at Level 5: "Code must not be written by humans. Code must not be reviewed by humans." Engineers design specs, curate test scenarios, and watch scores. Agents do everything else.
Willison's key observation: engineers shift from building code to building the systems that build the code. The critical unsolved question is how agents prove their code works without human review. The answer, for StrongDM and for TKO, is tests โ not as a checkbox, but as the primary artifact that defines correctness.
TKO is currently between Level 3 and Level 4. Agents (Claude Code, Copilot) do most of the implementation. Humans review PRs, set priorities, and make architectural decisions. The goal is to push toward Level 5 where practical, while being honest about where human judgment is still required.
- The Five Levels: from Spicy Autocomplete to the Dark Factory โ Simon Willison
- StrongDM's Software Factory โ Willison's writeup
- An AI State of the Union โ Willison on the inflection point
- Built by Agents, Tested by Agents, Trusted by Whom? โ Stanford CodeX
The tooling modernization (Phases 1โ6) was the foundation:
- Verified behaviors โ
verified-behaviors.jsonfiles are test-backed contracts that AI agents can check their work against - SOUL.md โ the philosophical foundation of Knockout, so agents understand why the framework works the way it does
- AGENTS.md โ instructions that any AI coding tool can follow
- llms.txt โ concise project context for LLM consumption
bun run verifyโ single command to confirm nothing is broken (biome + tsc + build + verify:esm + vitest)bun run knipโ detect dead code, unused deps- Changesets โ structured release management
- CI safety net โ lint, typecheck, test, ESM verification on every PR
- Branch protection โ all changes go through PRs
- plans/ โ documented intent so agents understand context
| Gap | What's needed |
|---|---|
| Dependency updates | Renovate/Dependabot with 48h minimumReleaseAge |
| Bundle size tracking | CI check comparing browser.min.js against main |
| Benchmarks | vitest bench for observable/computed hot paths |
| Coverage confidence | Know which code paths are covered, which aren't |
| Autonomous PR review | AI reviewer that checks against verified behaviors |
| Release automation | Tag + release triggered by changeset merge, not manual |
| Copilot/Cursor support | .github/copilot-instructions.md extending AGENTS.md |
| Issue triage | AI can read an issue, reproduce it, propose a fix |
| Scenario testing | StrongDM-style holdout scenarios for end-to-end validation |
-
Tests are the source of truth. If it's not tested, it doesn't exist. AI agents should never make changes they can't verify. Tests are not a checkbox โ they are the primary artifact that defines correctness.
-
Safety by default. The CI pipeline should catch any regression an AI introduces.
bun run verifymust pass before any commit. -
Intent over implementation. Plans and SOUL.md describe why things work the way they do. Code describes what. AI can change the what if it understands the why.
-
Small, reviewable changes. One concern per PR. A human should be able to review any AI-generated PR in under 5 minutes.
-
No magic. Every tool, script, and CI step should be understandable by reading the code. No hidden state, no implicit dependencies.
-
Earn trust incrementally. Start with low-risk automation (deps, docs, formatting). Expand to bug fixes and features as confidence grows. The level of autonomy an agent gets should match the level of safety net around it.