Skip to content

feat(cli): add micasa eval fts subcommand#963

Merged
cpcloud merged 1 commit intomicasa-dev:mainfrom
cpcloud:fts-eval-harness
Apr 23, 2026
Merged

feat(cli): add micasa eval fts subcommand#963
cpcloud merged 1 commit intomicasa-dev:mainfrom
cpcloud:fts-eval-harness

Conversation

@cpcloud
Copy link
Copy Markdown
Collaborator

@cpcloud cpcloud commented Apr 20, 2026

Summary

Adds a chat-quality evaluation harness for the FTS-enrichment feature. No behavior change to the TUI chat pipeline — this PR only adds the eval surface and exports the prompt-building helpers it needs.

  • internal/ftseval/ package: typed Config, Question, ArmResult, RunResult, GradeResult. Run() drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns per-question results.
  • SeedFixture populates projects, vendors, appliances, maintenance items, incidents, one service log, and one quote (with the "permit delays" long-tail vendor note).
  • Default question set: disambiguation, cross-entity joins, service-log lookup, FTS-neutral aggregate, basement incidents, nonexistent entity, long-tail note, brand filter.
  • Judge-score sentinel -1 when the judge didn't run; 0-5 when it did. Judge parser tolerates real-world model output: markdown decoration, :/= separators, mixed case, leading <think>/<thinking>/<reasoning> blocks, "Rationale" as an alias for "Reason". judge_reason surfaces in Notes when the score is the sentinel.
  • Table report (default on TTYs, via lipgloss), markdown (default when piping or writing to a file), and JSON. JSON redacts APIKey. Judge-score aggregates exclude sentinel rows.
  • --strict exits 1 on per-question FTS-on rubric regression over questions completed on both arms.

Prompt-builder refactor in internal/llm/prompt.go:

  • BuildTableInfo(store) exports the former app.buildTableInfoFrom so the eval reproduces the schema section of chat prompts exactly.
  • BuildFTSContext(entries) and BuildFTSContextFromStore(store, q) are the FTS-context formatters. They're unused on the chat path (chat passes "" for ftsContext everywhere); the follow-up chat wiring PR routes real FTS results through them.
  • BuildSQLPrompt / BuildSummaryPrompt / BuildSystemPrompt take a new ftsContext string positional arg. Chat passes "" — identical prompt text to pre-FTS behavior.

CLI: micasa eval fts with --db, --provider, --model, --judge-model, --questions, --skip-judge, --no-ab, --format, --output, --strict. Default fixture built in a tempdir that cleans up on exit; --db points at an existing store. Privacy warning on stderr when running against a non-fixture DB on a non-local provider.

Nix: nix run '.#fts-eval' wraps the subcommand.

Refs #707

@cpcloud cpcloud added enhancement New feature or request llm LLM and chat features labels Apr 20, 2026
@cpcloud cpcloud force-pushed the fts-eval-harness branch 4 times, most recently from 25ed0b9 to 8d665d2 Compare April 20, 2026 13:22
Wires a chat-quality evaluation harness for the FTS-enrichment feature.
No behavior change to the TUI chat pipeline -- this PR only adds the
eval surface and exports the prompt-building helpers it needs.

- `internal/ftseval/` package: typed Config, Question, ArmResult,
  RunResult, GradeResult. Run() drives each question through both
  FTS arms against a pre-built store, grades with a deterministic
  regex rubric plus an optional LLM judge, and returns per-question
  results.
- `SeedFixture` populates projects, vendors, appliances, maintenance
  items, incidents, one service log, and one quote with the "permit
  delays" long-tail vendor note.
- Default question set covering disambiguation, cross-entity joins,
  service-log lookup, FTS-neutral aggregate, basement incidents,
  nonexistent entity, long-tail note, and brand filter.
- Judge-score sentinel -1 when the judge didn't run; 0-5 when it did.
  Judge parser tolerates real-world model output: markdown decoration,
  `:`/`=` separators, mixed case, leading <think>/<thinking>/<reasoning>
  blocks, and "Rationale" as an alias for "Reason". judge_reason
  surfaces in Notes when the score is the sentinel.
- Table report (default on TTYs, via lipgloss), markdown (default when
  piping or writing to a file), and JSON. JSON redacts APIKey.
  Judge-score aggregates exclude sentinel rows. detectDarkBG guards
  lipgloss.HasDarkBackground behind a stdin-is-a-TTY check plus a
  recover() fallback so the reporter stays safe in CI (including
  Windows, where lipgloss's terminal query can panic on non-TTY
  stdin).
- `--strict` exits 1 on per-question FTS-on rubric regression over
  questions completed on both arms (sql_error counts as completed;
  provider errors don't). runEvalFTS splits into an inner doEvalFTS
  that returns (int, error) so deferred cleanup fires before os.Exit
  when strict mode triggers a non-zero exit.

Prompt-builder refactor (in `internal/llm/prompt.go`):

- `BuildTableInfo(store)` exports the former `app.buildTableInfoFrom`
  so the eval reproduces the schema section of chat prompts exactly.
- `BuildFTSContext(entries)` and `BuildFTSContextFromStore(store, q)`
  are the FTS-context formatters. They're unused on the chat path
  (chat passes `""` for ftsContext everywhere); the follow-up chat
  wiring PR routes real FTS results through them.
- `BuildSQLPrompt` / `BuildSummaryPrompt` / `BuildSystemPrompt` take
  a new `ftsContext string` positional arg. Chat passes `""` --
  identical prompt text to pre-FTS behavior. The arg is load-bearing
  only when a caller populates it; the eval does, chat does not.

CLI: `micasa eval fts` with --db, --provider, --model, --judge-model,
--questions, --skip-judge, --no-ab, --format, --output, --strict.
Default fixture is built in a tempdir that cleans up on exit; --db
points at an existing store. Privacy warning on stderr when running
against a non-fixture DB on a non-local provider.

Nix: `nix run '.#fts-eval'` wraps the subcommand.

Refs micasa-dev#707.
@cpcloud cpcloud merged commit f3ff269 into micasa-dev:main Apr 23, 2026
28 checks passed
@cpcloud cpcloud deleted the fts-eval-harness branch April 23, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request llm LLM and chat features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant