feat(cli): add micasa eval fts subcommand#963
Merged
cpcloud merged 1 commit intomicasa-dev:mainfrom Apr 23, 2026
Merged
Conversation
25ed0b9 to
8d665d2
Compare
Wires a chat-quality evaluation harness for the FTS-enrichment feature. No behavior change to the TUI chat pipeline -- this PR only adds the eval surface and exports the prompt-building helpers it needs. - `internal/ftseval/` package: typed Config, Question, ArmResult, RunResult, GradeResult. Run() drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns per-question results. - `SeedFixture` populates projects, vendors, appliances, maintenance items, incidents, one service log, and one quote with the "permit delays" long-tail vendor note. - Default question set covering disambiguation, cross-entity joins, service-log lookup, FTS-neutral aggregate, basement incidents, nonexistent entity, long-tail note, and brand filter. - Judge-score sentinel -1 when the judge didn't run; 0-5 when it did. Judge parser tolerates real-world model output: markdown decoration, `:`/`=` separators, mixed case, leading <think>/<thinking>/<reasoning> blocks, and "Rationale" as an alias for "Reason". judge_reason surfaces in Notes when the score is the sentinel. - Table report (default on TTYs, via lipgloss), markdown (default when piping or writing to a file), and JSON. JSON redacts APIKey. Judge-score aggregates exclude sentinel rows. detectDarkBG guards lipgloss.HasDarkBackground behind a stdin-is-a-TTY check plus a recover() fallback so the reporter stays safe in CI (including Windows, where lipgloss's terminal query can panic on non-TTY stdin). - `--strict` exits 1 on per-question FTS-on rubric regression over questions completed on both arms (sql_error counts as completed; provider errors don't). runEvalFTS splits into an inner doEvalFTS that returns (int, error) so deferred cleanup fires before os.Exit when strict mode triggers a non-zero exit. Prompt-builder refactor (in `internal/llm/prompt.go`): - `BuildTableInfo(store)` exports the former `app.buildTableInfoFrom` so the eval reproduces the schema section of chat prompts exactly. - `BuildFTSContext(entries)` and `BuildFTSContextFromStore(store, q)` are the FTS-context formatters. They're unused on the chat path (chat passes `""` for ftsContext everywhere); the follow-up chat wiring PR routes real FTS results through them. - `BuildSQLPrompt` / `BuildSummaryPrompt` / `BuildSystemPrompt` take a new `ftsContext string` positional arg. Chat passes `""` -- identical prompt text to pre-FTS behavior. The arg is load-bearing only when a caller populates it; the eval does, chat does not. CLI: `micasa eval fts` with --db, --provider, --model, --judge-model, --questions, --skip-judge, --no-ab, --format, --output, --strict. Default fixture is built in a tempdir that cleans up on exit; --db points at an existing store. Privacy warning on stderr when running against a non-fixture DB on a non-local provider. Nix: `nix run '.#fts-eval'` wraps the subcommand. Refs micasa-dev#707.
8d665d2 to
3965ce6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a chat-quality evaluation harness for the FTS-enrichment feature. No behavior change to the TUI chat pipeline — this PR only adds the eval surface and exports the prompt-building helpers it needs.
internal/ftseval/package: typedConfig,Question,ArmResult,RunResult,GradeResult.Run()drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns per-question results.SeedFixturepopulates projects, vendors, appliances, maintenance items, incidents, one service log, and one quote (with the "permit delays" long-tail vendor note).-1when the judge didn't run; 0-5 when it did. Judge parser tolerates real-world model output: markdown decoration,:/=separators, mixed case, leading<think>/<thinking>/<reasoning>blocks, "Rationale" as an alias for "Reason".judge_reasonsurfaces in Notes when the score is the sentinel.APIKey. Judge-score aggregates exclude sentinel rows.--strictexits 1 on per-question FTS-on rubric regression over questions completed on both arms.Prompt-builder refactor in
internal/llm/prompt.go:BuildTableInfo(store)exports the formerapp.buildTableInfoFromso the eval reproduces the schema section of chat prompts exactly.BuildFTSContext(entries)andBuildFTSContextFromStore(store, q)are the FTS-context formatters. They're unused on the chat path (chat passes""forftsContexteverywhere); the follow-up chat wiring PR routes real FTS results through them.BuildSQLPrompt/BuildSummaryPrompt/BuildSystemPrompttake a newftsContext stringpositional arg. Chat passes""— identical prompt text to pre-FTS behavior.CLI:
micasa eval ftswith--db,--provider,--model,--judge-model,--questions,--skip-judge,--no-ab,--format,--output,--strict. Default fixture built in a tempdir that cleans up on exit;--dbpoints at an existing store. Privacy warning on stderr when running against a non-fixture DB on a non-local provider.Nix:
nix run '.#fts-eval'wraps the subcommand.Refs #707