Skip to content

Commit a9003fa

Browse files
committed
feat(cli): add micasa eval fts subcommand
Wires a chat-quality evaluation harness for the FTS-enrichment feature. No behavior change to the TUI chat pipeline -- this PR only adds the eval surface and exports the prompt-building helpers it needs. - `internal/ftseval/` package: typed Config, Question, ArmResult, RunResult, GradeResult. Run() drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns per-question results. - `SeedFixture` populates projects, vendors, appliances, maintenance items, incidents, one service log, and one quote with the "permit delays" long-tail vendor note. - Default question set covering disambiguation, cross-entity joins, service-log lookup, FTS-neutral aggregate, basement incidents, nonexistent entity, long-tail note, and brand filter. - Judge-score sentinel -1 when the judge didn't run; 0-5 when it did. Judge parser tolerates real-world model output: markdown decoration, `:`/`=` separators, mixed case, leading <think>/<thinking>/<reasoning> blocks, and "Rationale" as an alias for "Reason". judge_reason surfaces in Notes when the score is the sentinel. - Table report (default on TTYs, via lipgloss), markdown (default when piping or writing to a file), and JSON. JSON redacts APIKey. Judge-score aggregates exclude sentinel rows. - `--strict` exits 1 on per-question FTS-on rubric regression over questions completed on both arms (sql_error counts as completed; provider errors don't). Prompt-builder refactor (in `internal/llm/prompt.go`): - `BuildTableInfo(store)` exports the former `app.buildTableInfoFrom` so the eval reproduces the schema section of chat prompts exactly. - `BuildFTSContext(entries)` and `BuildFTSContextFromStore(store, q)` are the FTS-context formatters. They're unused on the chat path (chat passes `""` for ftsContext everywhere); the follow-up chat wiring PR routes real FTS results through them. - `BuildSQLPrompt` / `BuildSummaryPrompt` / `BuildSystemPrompt` take a new `ftsContext string` positional arg. Chat passes `""` -- identical prompt text to pre-FTS behavior. The arg is load-bearing only when a caller populates it; the eval does, chat does not. CLI: `micasa eval fts` with --db, --provider, --model, --judge-model, --questions, --skip-judge, --no-ab, --format, --output, --strict. Default fixture is built in a tempdir that cleans up on exit; --db points at an existing store. Privacy warning on stderr when running against a non-fixture DB on a non-local provider. Nix: `nix run '.#fts-eval'` wraps the subcommand. Refs #707.
1 parent 22b7fdb commit a9003fa

14 files changed

Lines changed: 2244 additions & 56 deletions

File tree

cmd/micasa/eval.go

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
// Copyright 2026 Phillip Cloud
2+
// Licensed under the Apache License, Version 2.0
3+
4+
package main
5+
6+
import (
7+
"context"
8+
"fmt"
9+
"io"
10+
"os"
11+
"os/signal"
12+
"strings"
13+
"time"
14+
15+
"github.com/micasa-dev/micasa/internal/config"
16+
"github.com/micasa-dev/micasa/internal/data"
17+
"github.com/micasa-dev/micasa/internal/ftseval"
18+
"github.com/micasa-dev/micasa/internal/llm"
19+
"github.com/spf13/cobra"
20+
)
21+
22+
// evalOpts mirrors ftseval.Config plus CLI-only knobs. Populated by
23+
// Cobra flag parsing; validated inside RunE.
24+
type evalOpts struct {
25+
dbPath string
26+
provider string
27+
model string
28+
judgeModel string
29+
questions []string
30+
skipJudge bool
31+
noAB bool
32+
format string
33+
output string
34+
strict bool
35+
}
36+
37+
// newEvalCmd returns the `micasa eval` parent command. Sub-evals slot in
38+
// as children (`eval fts`, future `eval extraction`, etc.).
39+
func newEvalCmd() *cobra.Command {
40+
cmd := &cobra.Command{
41+
Use: "eval",
42+
Short: "Run chat-quality benchmarks against a fixture or user DB",
43+
Long: `Parent command for chat-quality evaluations. See subcommands.`,
44+
SilenceErrors: true,
45+
SilenceUsage: true,
46+
}
47+
cmd.AddCommand(newEvalFTSCmd())
48+
return cmd
49+
}
50+
51+
func newEvalFTSCmd() *cobra.Command {
52+
opts := &evalOpts{}
53+
cmd := &cobra.Command{
54+
Use: "fts",
55+
Short: "Run the FTS context-enrichment chat benchmark",
56+
Long: `Run the FTS chat benchmark against the default fixture DB or a
57+
user-supplied SQLite file. Each question runs twice (FTS on and FTS off) and
58+
is graded by a deterministic regex rubric, with an optional LLM judge pass.
59+
60+
The eval uses the chat config from the user's config file; --provider and
61+
--model override specific fields. Pointing --db at a real micasa DB sends
62+
prompts derived from household data to the configured provider -- if that
63+
provider is a cloud service, the data leaves the machine.`,
64+
SilenceErrors: true,
65+
SilenceUsage: true,
66+
RunE: func(cmd *cobra.Command, args []string) error {
67+
return runEvalFTS(cmd.OutOrStdout(), opts)
68+
},
69+
}
70+
71+
cmd.Flags().StringVar(&opts.dbPath, "db", "",
72+
"path to a micasa SQLite DB (default: fixture)")
73+
cmd.Flags().StringVar(&opts.provider, "provider", "",
74+
"override chat provider from config")
75+
cmd.Flags().StringVar(&opts.model, "model", "",
76+
"override chat model from config")
77+
cmd.Flags().StringVar(&opts.judgeModel, "judge-model", "",
78+
"model for the LLM judge (default: same as --model)")
79+
cmd.Flags().StringSliceVar(&opts.questions, "questions", nil,
80+
"comma-separated names of questions to run (default: all)")
81+
cmd.Flags().BoolVar(&opts.skipJudge, "skip-judge", false,
82+
"deterministic rubric only; skip the LLM judge")
83+
cmd.Flags().BoolVar(&opts.noAB, "no-ab", false,
84+
"run each question once (FTS on) instead of twice")
85+
cmd.Flags().StringVar(&opts.format, "format", "",
86+
"report format: table (default when TTY), markdown, or json")
87+
cmd.Flags().StringVar(&opts.output, "output", "",
88+
"write report to this file instead of stdout")
89+
cmd.Flags().BoolVar(&opts.strict, "strict", false,
90+
"exit non-zero on per-question rubric regression (completed on both arms)")
91+
92+
return cmd
93+
}
94+
95+
func runEvalFTS(defaultOut interface {
96+
Write([]byte) (int, error)
97+
}, opts *evalOpts,
98+
) error {
99+
cfg, err := config.Load()
100+
if err != nil {
101+
return fmt.Errorf("load config: %w", err)
102+
}
103+
104+
chatLLM := cfg.Chat.LLM
105+
provider := opts.provider
106+
if provider == "" {
107+
provider = chatLLM.Provider
108+
}
109+
model := opts.model
110+
if model == "" {
111+
model = chatLLM.Model
112+
}
113+
judgeModel := opts.judgeModel
114+
if judgeModel == "" {
115+
judgeModel = model
116+
}
117+
timeout := chatLLM.TimeoutDuration()
118+
if timeout <= 0 {
119+
timeout = 60 * time.Second
120+
}
121+
122+
// Privacy warning when running against a real DB on a non-local
123+
// provider.
124+
if opts.dbPath != "" && !isLocalLLMProvider(provider) {
125+
fmt.Fprintf(os.Stderr,
126+
"warning: eval will send prompts derived from %s to %s.\n"+
127+
"Press Ctrl-C within 5s to abort.\n",
128+
opts.dbPath, provider,
129+
)
130+
time.Sleep(5 * time.Second)
131+
}
132+
133+
// Open (or build) the store.
134+
store, fixture, cleanup, err := openEvalStore(opts.dbPath)
135+
if err != nil {
136+
return err
137+
}
138+
defer cleanup()
139+
140+
// Build LLM clients.
141+
client, err := llm.NewClient(provider, chatLLM.BaseURL, model, chatLLM.APIKey, timeout)
142+
if err != nil {
143+
return fmt.Errorf("build chat client: %w", err)
144+
}
145+
judge := client
146+
if judgeModel != model {
147+
judge, err = llm.NewClient(provider, chatLLM.BaseURL, judgeModel, chatLLM.APIKey, timeout)
148+
if err != nil {
149+
return fmt.Errorf("build judge client: %w", err)
150+
}
151+
}
152+
153+
harnessCfg := ftseval.Config{
154+
DBPath: opts.dbPath,
155+
Provider: provider,
156+
Model: model,
157+
JudgeModel: judgeModel,
158+
APIKey: chatLLM.APIKey,
159+
Timeout: timeout,
160+
Questions: opts.questions,
161+
SkipJudge: opts.skipJudge,
162+
NoAB: opts.noAB,
163+
Format: opts.format,
164+
Strict: opts.strict,
165+
}
166+
167+
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt)
168+
defer cancel()
169+
170+
results, err := ftseval.Run(ctx, harnessCfg, store, fixture, client, judge)
171+
if err != nil {
172+
return fmt.Errorf("run eval: %w", err)
173+
}
174+
175+
// Write report. Default format: "table" when writing to a TTY,
176+
// "markdown" otherwise (pipes, files, CI). --format overrides.
177+
out := io.Writer(defaultOut)
178+
if opts.output != "" {
179+
f, err := os.Create(opts.output)
180+
if err != nil {
181+
return fmt.Errorf("open report file: %w", err)
182+
}
183+
defer func() { _ = f.Close() }()
184+
out = f
185+
}
186+
if harnessCfg.Format == "" {
187+
if writerIsTerminal(out) {
188+
harnessCfg.Format = "table"
189+
} else {
190+
harnessCfg.Format = "markdown"
191+
}
192+
}
193+
if err := ftseval.WriteReport(out, harnessCfg, results); err != nil {
194+
return fmt.Errorf("write report: %w", err)
195+
}
196+
197+
if code := ftseval.ExitCode(harnessCfg, results); code != 0 {
198+
os.Exit(code)
199+
}
200+
return nil
201+
}
202+
203+
// openEvalStore returns either the user-supplied SQLite store or a
204+
// freshly-seeded fixture. The returned cleanup closes the store and, for
205+
// the fixture path, removes the tempdir the fixture lives in.
206+
func openEvalStore(
207+
dbPath string,
208+
) (*data.Store, ftseval.SeededFixture, func(), error) {
209+
if dbPath != "" {
210+
s, err := data.Open(dbPath)
211+
if err != nil {
212+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("open %s: %w", dbPath, err)
213+
}
214+
cleanup := func() { _ = s.Close() }
215+
return s, ftseval.SeededFixture{}, cleanup, nil
216+
}
217+
218+
tmp, err := os.MkdirTemp("", "micasa-eval-*")
219+
if err != nil {
220+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("create fixture tempdir: %w", err)
221+
}
222+
removeTmp := func() { _ = os.RemoveAll(tmp) }
223+
224+
path := tmp + "/fixture.db"
225+
s, err := data.Open(path)
226+
if err != nil {
227+
removeTmp()
228+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("open fixture: %w", err)
229+
}
230+
closeStore := func() { _ = s.Close() }
231+
if err := s.AutoMigrate(); err != nil {
232+
closeStore()
233+
removeTmp()
234+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("migrate fixture: %w", err)
235+
}
236+
if err := s.SeedDefaults(); err != nil {
237+
closeStore()
238+
removeTmp()
239+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("seed fixture defaults: %w", err)
240+
}
241+
fx, err := ftseval.SeedFixture(s)
242+
if err != nil {
243+
closeStore()
244+
removeTmp()
245+
return nil, ftseval.SeededFixture{}, nil, fmt.Errorf("seed fixture entities: %w", err)
246+
}
247+
248+
cleanup := func() {
249+
closeStore()
250+
removeTmp()
251+
}
252+
return s, fx, cleanup, nil
253+
}
254+
255+
// isLocalLLMProvider reports whether the named provider runs on the same
256+
// machine (so no household data leaves the machine).
257+
func isLocalLLMProvider(provider string) bool {
258+
switch strings.ToLower(provider) {
259+
case "ollama", "llamacpp", "llamafile":
260+
return true
261+
}
262+
return false
263+
}

cmd/micasa/main.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ func newRootCmd() *cobra.Command {
7979
newGenCLIRefCmd(),
8080
newDBCmd(),
8181
newStatusCmd(),
82+
newEvalCmd(),
8283
)
8384

8485
return root

docs/content/docs/reference/cli.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ micasa [database-path] [flags]
3535
- [`micasa config`](#micasa-config) -- Manage application configuration
3636
- [`micasa db`](#micasa-db) -- Read and write entity data
3737
- [`micasa demo`](#micasa-demo) -- Launch with sample data in an in-memory database
38+
- [`micasa eval`](#micasa-eval) -- Run chat-quality benchmarks against a fixture or user DB
3839
- [`micasa mcp`](#micasa-mcp) -- Run MCP server for LLM client access
3940
- [`micasa pro`](#micasa-pro) -- Manage micasa Pro sync
4041
- [`micasa query`](#micasa-query) -- Run a read-only SQL query
@@ -1643,6 +1644,61 @@ micasa demo [database-path] [flags]
16431644

16441645
- [`micasa`](#micasa) -- A terminal UI for tracking everything about your home
16451646

1647+
## micasa eval
1648+
1649+
Parent command for chat-quality evaluations. See subcommands.
1650+
1651+
### Flags
1652+
1653+
| Flag | Default | Description |
1654+
|------|---------|-------------|
1655+
| `-h`, `--help` | - | help for eval |
1656+
1657+
### Subcommands
1658+
1659+
- [`micasa eval fts`](#micasa-eval-fts) -- Run the FTS context-enrichment chat benchmark
1660+
1661+
### See also
1662+
1663+
- [`micasa`](#micasa) -- A terminal UI for tracking everything about your home
1664+
1665+
## micasa eval fts
1666+
1667+
Run the FTS chat benchmark against the default fixture DB or a
1668+
user-supplied SQLite file. Each question runs twice (FTS on and FTS off) and
1669+
is graded by a deterministic regex rubric, with an optional LLM judge pass.
1670+
1671+
The eval uses the chat config from the user's config file; --provider and
1672+
--model override specific fields. Pointing --db at a real micasa DB sends
1673+
prompts derived from household data to the configured provider -- if that
1674+
provider is a cloud service, the data leaves the machine.
1675+
1676+
### Usage
1677+
1678+
```
1679+
micasa eval fts [flags]
1680+
```
1681+
1682+
### Flags
1683+
1684+
| Flag | Default | Description |
1685+
|------|---------|-------------|
1686+
| `--db` | - | path to a micasa SQLite DB (default: fixture) |
1687+
| `--format` | - | report format: table (default when TTY), markdown, or json |
1688+
| `-h`, `--help` | - | help for fts |
1689+
| `--judge-model` | - | model for the LLM judge (default: same as --model) |
1690+
| `--model` | - | override chat model from config |
1691+
| `--no-ab` | - | run each question once (FTS on) instead of twice |
1692+
| `--output` | - | write report to this file instead of stdout |
1693+
| `--provider` | - | override chat provider from config |
1694+
| `--questions` | `[]` | comma-separated names of questions to run (default: all) |
1695+
| `--skip-judge` | - | deterministic rubric only; skip the LLM judge |
1696+
| `--strict` | - | exit non-zero on per-question rubric regression (completed on both arms) |
1697+
1698+
### See also
1699+
1700+
- [`micasa eval`](#micasa-eval) -- Run chat-quality benchmarks against a fixture or user DB
1701+
16461702
## micasa mcp
16471703

16481704
Start a Model Context Protocol server over stdio, exposing micasa data to LLM clients like Claude Desktop and Claude Code.

flake.nix

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -484,6 +484,13 @@
484484
| column -t
485485
'';
486486
};
487+
fts-eval = pkgs.writeShellApplication {
488+
name = "fts-eval";
489+
runtimeInputs = [ self.packages.micasa ];
490+
text = ''
491+
exec micasa eval fts "$@"
492+
'';
493+
};
487494
run-pre-commit = pkgs.writeShellApplication {
488495
name = "run-pre-commit";
489496
runtimeInputs = [

0 commit comments

Comments
 (0)