|
| 1 | +# PR #7 Review: `feat(operator): operator UI, GitHub PAT auth, AI rule suggester` |
| 2 | + |
| 3 | +**Repo:** grove-platform/github-copier |
| 4 | +**PR:** https://github.com/grove-platform/github-copier/pull/7 |
| 5 | +**Branch:** `feat/operator-ui-audit` → `main` |
| 6 | +**Scope:** +5,373 / −73 across 28 files |
| 7 | +**Reviewer perspective:** experienced fullstack, webserver-app focus |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +This PR ships a major v0.4.0 milestone: a 5-tab operator UI backed by GitHub PAT auth with role derivation, webhook replay with per-repo permission checks, and an AI rule suggester with swappable Ollama/Anthropic providers through the Grove Foundry APIM gateway. One ~4,000 line embedded HTML bundle carries most of the frontend. |
| 12 | + |
| 13 | +The architecture is sound: `LLMClient` interface with per-provider implementations, self-verifying LLM output against the in-process `PatternMatcher`, fail-closed config validation (`OPERATOR_UI_ENABLED` + `OPERATOR_AUTH_REPO`), and Secret Manager loading for the Anthropic key. The `#nosec` annotations are justified — base URLs are operator-controlled, not user-controlled. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Blocking / High-priority issues |
| 18 | + |
| 19 | +### 1. Permission-check soft-fail grants writer role by default — auth bypass risk |
| 20 | +**File:** `services/operator_auth.go:186-190` |
| 21 | + |
| 22 | +```go |
| 23 | +if err != nil { |
| 24 | + LogWarning("GitHub permission check failed, defaulting to writer role", ...) |
| 25 | + return user, nil |
| 26 | +} |
| 27 | +``` |
| 28 | + |
| 29 | +The PR description's spec says *"None / 404 → denied → 401 Unauthorized"*, but the code returns **writer** on any error from `ghAPIGetRepoPermission` — including GitHub's 404 for "not a collaborator". Combined with `--allow-unauthenticated` on the Cloud Run service, **any** valid GitHub PAT lets a caller view audit logs, webhook traces, workflows, logs drawer, and invoke the AI suggester (which costs real Anthropic tokens). |
| 30 | + |
| 31 | +**Fix:** on 404, return `RoleDenied` + error; only soft-fail on transient 5xx. Distinguish the two cases in `ghAPIGetRepoPermission` by inspecting the status code. |
| 32 | + |
| 33 | +### 2. LLM cost exposure — `/suggest-rule` is unbounded per user |
| 34 | +**File:** `services/operator_suggest_rule.go:55` |
| 35 | + |
| 36 | +Any authenticated writer can call `/operator/api/suggest-rule` with no per-user rate limit. Combined with issue #1, this is effectively "anyone with a GitHub PAT can spend Anthropic credits." Add a per-token token bucket (even generous, e.g. 30/hour) before release. |
| 37 | + |
| 38 | +### 3. `/llm/status` calls real Anthropic `/v1/messages` on every refresh |
| 39 | +**File:** `services/llm_anthropic.go:105-131` |
| 40 | + |
| 41 | +The ping uses `max_tokens=1` but still hits the Messages API (one input + one output token) on every status tab refresh (writers poll this for the status badge). Cache the last successful ping for ~30s, or degrade to a cheap HEAD/OPTIONS against the gateway. |
| 42 | + |
| 43 | +### 4. Raw PATs stored as map keys |
| 44 | +**File:** `services/operator_auth.go:49`, `services/operator_auth.go:104` |
| 45 | + |
| 46 | +`entries[token]` and `repoPerm[token+"\x00"+repo]` keep full PATs live in the heap for the 5-min TTL. A memory dump (crash, Cloud Run profile, goroutine dump) would leak every active operator's token. Hash with SHA-256 and key the maps by digest — same security semantics, no plaintext secrets in process memory. |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Medium — worth addressing before merge |
| 51 | + |
| 52 | +### 5. Global LLM mutation surprises multi-operator use |
| 53 | +**File:** `services/operator_llm_admin.go:84-88` |
| 54 | + |
| 55 | +`SetActiveModel`/`SetBaseURL` mutate the shared `o.llm` client. Two operators on the UI at once will clobber each other's model choice, and changes silently revert on restart. Either document this as "last write wins, ephemeral" in the UI or make the active model per-request. |
| 56 | + |
| 57 | +### 6. Test coverage gap on the new security-critical surface |
| 58 | + |
| 59 | +The diff adds ~3 test files (audit, delivery tracker, github_write_to_target) but **zero tests** for: |
| 60 | +- `services/operator_auth.go` |
| 61 | +- `services/operator_ui.go` |
| 62 | +- `services/operator_suggest_rule.go` |
| 63 | +- `services/llm_anthropic.go` |
| 64 | +- `services/llm_client.go` |
| 65 | +- `services/operator_llm_admin.go` |
| 66 | + |
| 67 | +At minimum, add table-driven tests for `validateGitHubPAT` role mapping (including the 404 case from issue #1) and `verifySuggestedRule` for each transform type (move/copy/glob/regex). |
| 68 | + |
| 69 | +### 7. `handleRepoPermission` swallows the real error |
| 70 | +**File:** `services/operator_ui.go:225` |
| 71 | + |
| 72 | +```go |
| 73 | +canRead, _ := o.ghCache.CanUserReadRepo(ctx, userPAT, user.Login, repo) |
| 74 | +``` |
| 75 | + |
| 76 | +A GitHub rate-limit or 5xx is indistinguishable from "no access" to the frontend, so users see disabled replay buttons with no indication why. Return a per-repo `{allowed: bool, error?: string}` shape instead of `map[string]bool`. |
| 77 | + |
| 78 | +### 8. `githubCreateVersionTag` path components aren't escaped |
| 79 | +**File:** `services/operator_ui.go:805`, `services/operator_ui.go:842` |
| 80 | + |
| 81 | +Unlike `ghAPIGetRepoPermission`, this path uses raw `fmt.Sprintf` with `owner`, `repo`, `baseBranch`. The inputs are trusted (env vars), but apply the same `ghUsernameRe`/`ghRepoNameRe` + `url.PathEscape` treatment for defense-in-depth and consistency — you already established the pattern one file over. |
| 82 | + |
| 83 | +### 9. Anthropic fallback model list is a maintenance hazard |
| 84 | +**File:** `services/llm_anthropic.go:26-30` |
| 85 | + |
| 86 | +Hard-coding `claude-opus-4-7`, `claude-sonnet-4-6`, `claude-haiku-4-5-20251001` in the fallback means when those rotate you'll ship dead options to the UI. Either load from an embedded config file or trim to a single known-stable haiku alias. |
| 87 | + |
| 88 | +### 10. Model name inconsistency across configs |
| 89 | +- `configs/environment.go:281` → `claude-haiku-4-5-20251001` |
| 90 | +- `.github/workflows/ci.yml` → `claude-haiku-4-5` |
| 91 | + |
| 92 | +Both resolve via Anthropic aliasing, but pick one. The dated form is more reproducible; the alias drifts. |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## Minor / nits |
| 97 | + |
| 98 | +- `services/operator_ui.go:873-875`: `githubHTTPClient()` allocates a new `*http.Client` per call. Make it a package var. |
| 99 | +- `services/operator_auth.go:93-100` + `services/operator_auth.go:123-130`: cache eviction is O(n) inside the write lock; fine at 100/500, just worth knowing. |
| 100 | +- `services/web/operator/index.html` at ~4k lines is hard to review and diff. Splitting CSS/JS into sibling files and using `//go:embed web/operator/*` would make this reviewable and cacheable. |
| 101 | +- `services/operator_ui.go:461-462`: `releaseMode` is only `"disabled"` vs `"tag_create_enabled"` — consider a typed enum. |
| 102 | +- `services/operator_suggest_rule.go:344-349`: `truncate(s, n)` counts bytes not runes — cuts inside multi-byte glyphs. Use `utf8.RuneCountInString`-bounded slicing. |
| 103 | +- `services/operator_ui.go:738`: `operator_replay` deliveryID uses `time.Now().UnixMilli()` — two replays in the same ms collide. Add a short random suffix. |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +## What's solid |
| 108 | + |
| 109 | +- **SSRF hardening** in `ghAPIGetRepoPermission` is textbook — whitelist regex + `url.PathEscape` + pinned host. |
| 110 | +- **Fail-closed startup** in `validateOperatorAuth` correctly prevents the "UI enabled but no auth repo" footgun. |
| 111 | +- **Self-verification** of LLM output via `PatternMatcher` before showing to the user is the right pattern — catches hallucinated rules before they ship. |
| 112 | +- **In-flight dedup** on replay + per-repo permission gating on the source repo is a well-considered authorization model. |
| 113 | +- **Streaming NDJSON** for Ollama pulls with both request-context cancel and 20-min safety timeout is handled correctly. |
| 114 | +- Scoping `write` → `writer` (not `operator`) based on real docs-repo access patterns is exactly the kind of call you want a human to make, and the comment explaining it is excellent. |
| 115 | +- **Dual-header auth** (`x-api-key` + `api-key`) lets one client work against native Anthropic or the Azure APIM gateway — clean solution to a real deployment-flexibility problem. |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | +## Recommendation |
| 120 | + |
| 121 | +**Request changes** — items 1 and 2 are release-blocking for an `--allow-unauthenticated` Cloud Run deploy. Items 3, 4, and 6 are strongly recommended before cutting v0.4.0. The rest are polish. |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## For the next agent |
| 126 | + |
| 127 | +If you're picking this up to address the findings, suggested order: |
| 128 | + |
| 129 | +1. Fix issue #1 first — distinguish 404 from transient errors in `ghAPIGetRepoPermission`, return `RoleDenied` on 404. Add a test that asserts the 404 path returns denied. |
| 130 | +2. Add the token bucket for #2 (keyed by SHA-256 of PAT). |
| 131 | +3. While in `operator_auth.go`, do #4 (hash keys) in the same pass. |
| 132 | +4. Cache the `/llm/status` ping (#3) — 30s TTL stored on the `operatorUI` struct. |
| 133 | +5. Backfill tests (#6) covering the new paths. |
| 134 | +6. Sweep the mediums/minors. |
| 135 | + |
| 136 | +The PR description's "Authentication & authorization" table is the source of truth for the intended role mapping — the code should match it exactly. |
0 commit comments