Skip to content

Commit 25ed0b9

Browse files
committed
feat(cli): add micasa eval fts subcommand
Wires a chat-quality evaluation harness for the FTS-enrichment feature. No behavior change to the TUI chat pipeline -- this PR only adds the eval surface and exports the prompt-building helpers it needs. - `internal/ftseval/` package: typed Config, Question, ArmResult, RunResult, GradeResult. Run() drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns per-question results. - `SeedFixture` populates projects, vendors, appliances, maintenance items, incidents, one service log, and one quote with the "permit delays" long-tail vendor note. - Default question set covering disambiguation, cross-entity joins, service-log lookup, FTS-neutral aggregate, basement incidents, nonexistent entity, long-tail note, and brand filter. - Judge-score sentinel -1 when the judge didn't run; 0-5 when it did. Judge parser tolerates real-world model output: markdown decoration, `:`/`=` separators, mixed case, leading <think>/<thinking>/<reasoning> blocks, and "Rationale" as an alias for "Reason". judge_reason surfaces in Notes when the score is the sentinel. - Table report (default on TTYs, via lipgloss), markdown (default when piping or writing to a file), and JSON. JSON redacts APIKey. Judge-score aggregates exclude sentinel rows. detectDarkBG guards lipgloss.HasDarkBackground behind a stdin-is-a-TTY check plus a recover() fallback so the reporter stays safe in CI (including Windows, where lipgloss's terminal query can panic on non-TTY stdin). - `--strict` exits 1 on per-question FTS-on rubric regression over questions completed on both arms (sql_error counts as completed; provider errors don't). runEvalFTS splits into an inner doEvalFTS that returns (int, error) so deferred cleanup fires before os.Exit when strict mode triggers a non-zero exit. Prompt-builder refactor (in `internal/llm/prompt.go`): - `BuildTableInfo(store)` exports the former `app.buildTableInfoFrom` so the eval reproduces the schema section of chat prompts exactly. - `BuildFTSContext(entries)` and `BuildFTSContextFromStore(store, q)` are the FTS-context formatters. They're unused on the chat path (chat passes `""` for ftsContext everywhere); the follow-up chat wiring PR routes real FTS results through them. - `BuildSQLPrompt` / `BuildSummaryPrompt` / `BuildSystemPrompt` take a new `ftsContext string` positional arg. Chat passes `""` -- identical prompt text to pre-FTS behavior. The arg is load-bearing only when a caller populates it; the eval does, chat does not. CLI: `micasa eval fts` with --db, --provider, --model, --judge-model, --questions, --skip-judge, --no-ab, --format, --output, --strict. Default fixture is built in a tempdir that cleans up on exit; --db points at an existing store. Privacy warning on stderr when running against a non-fixture DB on a non-local provider. Nix: `nix run '.#fts-eval'` wraps the subcommand. Refs #707.
1 parent 1fd3cdc commit 25ed0b9

14 files changed

Lines changed: 2292 additions & 56 deletions

File tree

cmd/micasa/eval.go

Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
// Copyright 2026 Phillip Cloud
2+
// Licensed under the Apache License, Version 2.0
3+
4+
package main
5+
6+
import (
7+
"context"
8+
"fmt"
9+
"io"
10+
"os"
11+
"os/signal"
12+
"strings"
13+
"time"
14+
15+
"github.com/micasa-dev/micasa/internal/config"
16+
"github.com/micasa-dev/micasa/internal/data"
17+
"github.com/micasa-dev/micasa/internal/ftseval"
18+
"github.com/micasa-dev/micasa/internal/llm"
19+
"github.com/spf13/cobra"
20+
)
21+
22+
// evalOpts mirrors ftseval.Config plus CLI-only knobs. Populated by
23+
// Cobra flag parsing; validated inside RunE.
24+
type evalOpts struct {
25+
dbPath string
26+
provider string
27+
model string
28+
judgeModel string
29+
questions []string
30+
skipJudge bool
31+
noAB bool
32+
format string
33+
output string
34+
strict bool
35+
}
36+
37+
// newEvalCmd returns the `micasa eval` parent command. Sub-evals slot in
38+
// as children (`eval fts`, future `eval extraction`, etc.).
39+
func newEvalCmd() *cobra.Command {
40+
cmd := &cobra.Command{
41+
Use: "eval",
42+
Short: "Run chat-quality benchmarks against a fixture or user DB",
43+
Long: `Parent command for chat-quality evaluations. See subcommands.`,
44+
SilenceErrors: true,
45+
SilenceUsage: true,
46+
}
47+
cmd.AddCommand(newEvalFTSCmd())
48+
return cmd
49+
}
50+
51+
func newEvalFTSCmd() *cobra.Command {
52+
opts := &evalOpts{}
53+
cmd := &cobra.Command{
54+
Use: "fts",
55+
Short: "Run the FTS context-enrichment chat benchmark",
56+
Long: `Run the FTS chat benchmark against the default fixture DB or a
57+
user-supplied SQLite file. Each question runs twice (FTS on and FTS off) and
58+
is graded by a deterministic regex rubric, with an optional LLM judge pass.
59+
60+
The eval uses the chat config from the user's config file; --provider and
61+
--model override specific fields. Pointing --db at a real micasa DB sends
62+
prompts derived from household data to the configured provider -- if that
63+
provider is a cloud service, the data leaves the machine.`,
64+
SilenceErrors: true,
65+
SilenceUsage: true,
66+
RunE: func(cmd *cobra.Command, _ []string) error {
67+
return runEvalFTS(cmd.OutOrStdout(), opts)
68+
},
69+
}
70+
71+
cmd.Flags().StringVar(&opts.dbPath, "db", "",
72+
"path to a micasa SQLite DB (default: fixture)")
73+
cmd.Flags().StringVar(&opts.provider, "provider", "",
74+
"override chat provider from config")
75+
cmd.Flags().StringVar(&opts.model, "model", "",
76+
"override chat model from config")
77+
cmd.Flags().StringVar(&opts.judgeModel, "judge-model", "",
78+
"model for the LLM judge (default: same as --model)")
79+
cmd.Flags().StringSliceVar(&opts.questions, "questions", nil,
80+
"comma-separated names of questions to run (default: all)")
81+
cmd.Flags().BoolVar(&opts.skipJudge, "skip-judge", false,
82+
"deterministic rubric only; skip the LLM judge")
83+
cmd.Flags().BoolVar(&opts.noAB, "no-ab", false,
84+
"run each question once (FTS on) instead of twice")
85+
cmd.Flags().StringVar(&opts.format, "format", "",
86+
"report format: table (default when TTY), markdown, or json")
87+
cmd.Flags().StringVar(&opts.output, "output", "",
88+
"write report to this file instead of stdout")
89+
cmd.Flags().BoolVar(&opts.strict, "strict", false,
90+
"exit non-zero on per-question rubric regression (completed on both arms)")
91+
92+
return cmd
93+
}
94+
95+
func runEvalFTS(defaultOut io.Writer, opts *evalOpts) error {
96+
code, err := doEvalFTS(defaultOut, opts)
97+
if err != nil {
98+
return err
99+
}
100+
if code != 0 {
101+
os.Exit(code)
102+
}
103+
return nil
104+
}
105+
106+
func doEvalFTS(defaultOut io.Writer, opts *evalOpts) (int, error) {
107+
cfg, err := config.Load()
108+
if err != nil {
109+
return 0, fmt.Errorf("load config: %w", err)
110+
}
111+
112+
chatLLM := cfg.Chat.LLM
113+
provider := opts.provider
114+
if provider == "" {
115+
provider = chatLLM.Provider
116+
}
117+
model := opts.model
118+
if model == "" {
119+
model = chatLLM.Model
120+
}
121+
judgeModel := opts.judgeModel
122+
if judgeModel == "" {
123+
judgeModel = model
124+
}
125+
timeout := chatLLM.TimeoutDuration()
126+
if timeout <= 0 {
127+
timeout = 60 * time.Second
128+
}
129+
130+
// Privacy warning when running against a real DB on a non-local
131+
// provider.
132+
if opts.dbPath != "" && !isLocalLLMProvider(provider) {
133+
fmt.Fprintf(os.Stderr,
134+
"warning: eval will send prompts derived from %s to %s.\n"+
135+
"Press Ctrl-C within 5s to abort.\n",
136+
opts.dbPath, provider,
137+
)
138+
time.Sleep(5 * time.Second)
139+
}
140+
141+
// Open (or build) the store.
142+
store, fixture, cleanup, err := openEvalStore(opts.dbPath)
143+
if err != nil {
144+
return 0, err
145+
}
146+
defer cleanup()
147+
148+
// Build LLM clients.
149+
client, err := llm.NewClient(provider, chatLLM.BaseURL, model, chatLLM.APIKey, timeout)
150+
if err != nil {
151+
return 0, fmt.Errorf("build chat client: %w", err)
152+
}
153+
judge := client
154+
if judgeModel != model {
155+
judge, err = llm.NewClient(provider, chatLLM.BaseURL, judgeModel, chatLLM.APIKey, timeout)
156+
if err != nil {
157+
return 0, fmt.Errorf("build judge client: %w", err)
158+
}
159+
}
160+
161+
harnessCfg := ftseval.Config{
162+
DBPath: opts.dbPath,
163+
Provider: provider,
164+
Model: model,
165+
JudgeModel: judgeModel,
166+
APIKey: chatLLM.APIKey,
167+
Timeout: timeout,
168+
Questions: opts.questions,
169+
SkipJudge: opts.skipJudge,
170+
NoAB: opts.noAB,
171+
Format: opts.format,
172+
Strict: opts.strict,
173+
}
174+
175+
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt)
176+
defer cancel()
177+
178+
results, err := ftseval.Run(ctx, harnessCfg, store, fixture, client, judge)
179+
if err != nil {
180+
return 0, fmt.Errorf("run eval: %w", err)
181+
}
182+
183+
// Write report. Default format: "table" when writing to a TTY,
184+
// "markdown" otherwise (pipes, files, CI). --format overrides.
185+
out := defaultOut
186+
if opts.output != "" {
187+
f, err := os.Create(opts.output)
188+
if err != nil {
189+
return 0, fmt.Errorf("open report file: %w", err)
190+
}
191+
defer func() { _ = f.Close() }()
192+
out = f
193+
}
194+
if harnessCfg.Format == "" {
195+
if writerIsTerminal(out) {
196+
harnessCfg.Format = "table"
197+
} else {
198+
harnessCfg.Format = "markdown"
199+
}
200+
}
201+
if err := ftseval.WriteReport(out, harnessCfg, results); err != nil {
202+
return 0, fmt.Errorf("write report: %w", err)
203+
}
204+
205+
return ftseval.ExitCode(harnessCfg, results), nil
206+
}
207+
208+
// openEvalStore returns either the user-supplied SQLite store or a
209+
// freshly-seeded fixture. The returned cleanup closes the store and, for
210+
// the fixture path, removes the tempdir the fixture lives in.
211+
func openEvalStore(
212+
dbPath string,
213+
) (*data.Store, ftseval.SeededFixture, func(), error) {
214+
if dbPath != "" {
215+
s, err := data.Open(dbPath)
216+
if err != nil {
217+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("open %s: %w", dbPath, err)
218+
}
219+
cleanup := func() { _ = s.Close() }
220+
return s, ftseval.SeededFixture{}, cleanup, nil
221+
}
222+
223+
tmp, err := os.MkdirTemp("", "micasa-eval-*")
224+
if err != nil {
225+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("create fixture tempdir: %w", err)
226+
}
227+
removeTmp := func() { _ = os.RemoveAll(tmp) }
228+
229+
path := tmp + "/fixture.db"
230+
s, err := data.Open(path)
231+
if err != nil {
232+
removeTmp()
233+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("open fixture: %w", err)
234+
}
235+
closeStore := func() { _ = s.Close() }
236+
if err := s.AutoMigrate(); err != nil {
237+
closeStore()
238+
removeTmp()
239+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("migrate fixture: %w", err)
240+
}
241+
if err := s.SeedDefaults(); err != nil {
242+
closeStore()
243+
removeTmp()
244+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("seed fixture defaults: %w", err)
245+
}
246+
fx, err := ftseval.SeedFixture(s)
247+
if err != nil {
248+
closeStore()
249+
removeTmp()
250+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("seed fixture entities: %w", err)
251+
}
252+
253+
cleanup := func() {
254+
closeStore()
255+
removeTmp()
256+
}
257+
return s, fx, cleanup, nil
258+
}
259+
260+
// isLocalLLMProvider reports whether the named provider runs on the same
261+
// machine (so no household data leaves the machine).
262+
func isLocalLLMProvider(provider string) bool {
263+
switch strings.ToLower(provider) {
264+
case "ollama", "llamacpp", "llamafile":
265+
return true
266+
}
267+
return false
268+
}

cmd/micasa/main.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ func newRootCmd() *cobra.Command {
7979
newGenCLIRefCmd(),
8080
newDBCmd(),
8181
newStatusCmd(),
82+
newEvalCmd(),
8283
)
8384

8485
return root

docs/content/docs/reference/cli.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ micasa [database-path] [flags]
3535
- [`micasa config`](#micasa-config) -- Manage application configuration
3636
- [`micasa db`](#micasa-db) -- Read and write entity data
3737
- [`micasa demo`](#micasa-demo) -- Launch with sample data in an in-memory database
38+
- [`micasa eval`](#micasa-eval) -- Run chat-quality benchmarks against a fixture or user DB
3839
- [`micasa mcp`](#micasa-mcp) -- Run MCP server for LLM client access
3940
- [`micasa pro`](#micasa-pro) -- Manage micasa Pro sync
4041
- [`micasa query`](#micasa-query) -- Run a read-only SQL query
@@ -1643,6 +1644,61 @@ micasa demo [database-path] [flags]
16431644

16441645
- [`micasa`](#micasa) -- A terminal UI for tracking everything about your home
16451646

1647+
## micasa eval
1648+
1649+
Parent command for chat-quality evaluations. See subcommands.
1650+
1651+
### Flags
1652+
1653+
| Flag | Default | Description |
1654+
|------|---------|-------------|
1655+
| `-h`, `--help` | - | help for eval |
1656+
1657+
### Subcommands
1658+
1659+
- [`micasa eval fts`](#micasa-eval-fts) -- Run the FTS context-enrichment chat benchmark
1660+
1661+
### See also
1662+
1663+
- [`micasa`](#micasa) -- A terminal UI for tracking everything about your home
1664+
1665+
## micasa eval fts
1666+
1667+
Run the FTS chat benchmark against the default fixture DB or a
1668+
user-supplied SQLite file. Each question runs twice (FTS on and FTS off) and
1669+
is graded by a deterministic regex rubric, with an optional LLM judge pass.
1670+
1671+
The eval uses the chat config from the user's config file; --provider and
1672+
--model override specific fields. Pointing --db at a real micasa DB sends
1673+
prompts derived from household data to the configured provider -- if that
1674+
provider is a cloud service, the data leaves the machine.
1675+
1676+
### Usage
1677+
1678+
```
1679+
micasa eval fts [flags]
1680+
```
1681+
1682+
### Flags
1683+
1684+
| Flag | Default | Description |
1685+
|------|---------|-------------|
1686+
| `--db` | - | path to a micasa SQLite DB (default: fixture) |
1687+
| `--format` | - | report format: table (default when TTY), markdown, or json |
1688+
| `-h`, `--help` | - | help for fts |
1689+
| `--judge-model` | - | model for the LLM judge (default: same as --model) |
1690+
| `--model` | - | override chat model from config |
1691+
| `--no-ab` | - | run each question once (FTS on) instead of twice |
1692+
| `--output` | - | write report to this file instead of stdout |
1693+
| `--provider` | - | override chat provider from config |
1694+
| `--questions` | `[]` | comma-separated names of questions to run (default: all) |
1695+
| `--skip-judge` | - | deterministic rubric only; skip the LLM judge |
1696+
| `--strict` | - | exit non-zero on per-question rubric regression (completed on both arms) |
1697+
1698+
### See also
1699+
1700+
- [`micasa eval`](#micasa-eval) -- Run chat-quality benchmarks against a fixture or user DB
1701+
16461702
## micasa mcp
16471703

16481704
Start a Model Context Protocol server over stdio, exposing micasa data to LLM clients like Claude Desktop and Claude Code.

flake.nix

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -484,6 +484,13 @@
484484
| column -t
485485
'';
486486
};
487+
fts-eval = pkgs.writeShellApplication {
488+
name = "fts-eval";
489+
runtimeInputs = [ self.packages.micasa ];
490+
text = ''
491+
exec micasa eval fts "$@"
492+
'';
493+
};
487494
run-pre-commit = pkgs.writeShellApplication {
488495
name = "run-pre-commit";
489496
runtimeInputs = [

0 commit comments

Comments
 (0)