feat(cli): add micasa eval fts subcommand by cpcloud · Pull Request #963 · micasa-dev/micasa

cpcloud · 2026-04-20T12:19:57Z

Summary

Adds a chat-quality evaluation harness for the FTS-enrichment feature. No behavior change to the TUI chat pipeline — this PR only adds the eval surface and exports the prompt-building helpers it needs.

internal/ftseval/ package: typed Config, Question, ArmResult, RunResult, GradeResult. Run() drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns per-question results.
SeedFixture populates projects, vendors, appliances, maintenance items, incidents, one service log, and one quote (with the "permit delays" long-tail vendor note).
Default question set: disambiguation, cross-entity joins, service-log lookup, FTS-neutral aggregate, basement incidents, nonexistent entity, long-tail note, brand filter.
Judge-score sentinel -1 when the judge didn't run; 0-5 when it did. Judge parser tolerates real-world model output: markdown decoration, :/= separators, mixed case, leading <think>/<thinking>/<reasoning> blocks, "Rationale" as an alias for "Reason". judge_reason surfaces in Notes when the score is the sentinel.
Table report (default on TTYs, via lipgloss), markdown (default when piping or writing to a file), and JSON. JSON redacts APIKey. Judge-score aggregates exclude sentinel rows.
--strict exits 1 on per-question FTS-on rubric regression over questions completed on both arms.

Prompt-builder refactor in internal/llm/prompt.go:

BuildTableInfo(store) exports the former app.buildTableInfoFrom so the eval reproduces the schema section of chat prompts exactly.
BuildFTSContext(entries) and BuildFTSContextFromStore(store, q) are the FTS-context formatters. They're unused on the chat path (chat passes "" for ftsContext everywhere); the follow-up chat wiring PR routes real FTS results through them.
BuildSQLPrompt / BuildSummaryPrompt / BuildSystemPrompt take a new ftsContext string positional arg. Chat passes "" — identical prompt text to pre-FTS behavior.

CLI: micasa eval fts with --db, --provider, --model, --judge-model, --questions, --skip-judge, --no-ab, --format, --output, --strict. Default fixture built in a tempdir that cleans up on exit; --db points at an existing store. Privacy warning on stderr when running against a non-fixture DB on a non-local provider.

Nix: nix run '.#fts-eval' wraps the subcommand.

Refs #707

Wires a chat-quality evaluation harness for the FTS-enrichment feature. No behavior change to the TUI chat pipeline -- this PR only adds the eval surface and exports the prompt-building helpers it needs. - `internal/ftseval/` package: typed Config, Question, ArmResult, RunResult, GradeResult. Run() drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns per-question results. - `SeedFixture` populates projects, vendors, appliances, maintenance items, incidents, one service log, and one quote with the "permit delays" long-tail vendor note. - Default question set covering disambiguation, cross-entity joins, service-log lookup, FTS-neutral aggregate, basement incidents, nonexistent entity, long-tail note, and brand filter. - Judge-score sentinel -1 when the judge didn't run; 0-5 when it did. Judge parser tolerates real-world model output: markdown decoration, `:`/`=` separators, mixed case, leading <think>/<thinking>/<reasoning> blocks, and "Rationale" as an alias for "Reason". judge_reason surfaces in Notes when the score is the sentinel. - Table report (default on TTYs, via lipgloss), markdown (default when piping or writing to a file), and JSON. JSON redacts APIKey. Judge-score aggregates exclude sentinel rows. detectDarkBG guards lipgloss.HasDarkBackground behind a stdin-is-a-TTY check plus a recover() fallback so the reporter stays safe in CI (including Windows, where lipgloss's terminal query can panic on non-TTY stdin). - `--strict` exits 1 on per-question FTS-on rubric regression over questions completed on both arms (sql_error counts as completed; provider errors don't). runEvalFTS splits into an inner doEvalFTS that returns (int, error) so deferred cleanup fires before os.Exit when strict mode triggers a non-zero exit. Prompt-builder refactor (in `internal/llm/prompt.go`): - `BuildTableInfo(store)` exports the former `app.buildTableInfoFrom` so the eval reproduces the schema section of chat prompts exactly. - `BuildFTSContext(entries)` and `BuildFTSContextFromStore(store, q)` are the FTS-context formatters. They're unused on the chat path (chat passes `""` for ftsContext everywhere); the follow-up chat wiring PR routes real FTS results through them. - `BuildSQLPrompt` / `BuildSummaryPrompt` / `BuildSystemPrompt` take a new `ftsContext string` positional arg. Chat passes `""` -- identical prompt text to pre-FTS behavior. The arg is load-bearing only when a caller populates it; the eval does, chat does not. CLI: `micasa eval fts` with --db, --provider, --model, --judge-model, --questions, --skip-judge, --no-ab, --format, --output, --strict. Default fixture is built in a tempdir that cleans up on exit; --db points at an existing store. Privacy warning on stderr when running against a non-fixture DB on a non-local provider. Nix: `nix run '.#fts-eval'` wraps the subcommand. Refs micasa-dev#707.

cpcloud added enhancement New feature or request llm LLM and chat features labels Apr 20, 2026

cpcloud mentioned this pull request Apr 20, 2026

feat(data): FTS-powered context enrichment for LLM chat #933

Closed

cpcloud force-pushed the fts-eval-harness branch 4 times, most recently from 25ed0b9 to 8d665d2 Compare April 20, 2026 13:22

cpcloud force-pushed the fts-eval-harness branch from 8d665d2 to 3965ce6 Compare April 23, 2026 10:22

cpcloud merged commit f3ff269 into micasa-dev:main Apr 23, 2026
28 checks passed

cpcloud deleted the fts-eval-harness branch April 23, 2026 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): add micasa eval fts subcommand#963

feat(cli): add micasa eval fts subcommand#963
cpcloud merged 1 commit intomicasa-dev:mainfrom
cpcloud:fts-eval-harness

cpcloud commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cpcloud commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cpcloud commented Apr 20, 2026 •

edited

Loading