docs: document operator UI, PAT auth, and AI rule suggester

cbullinger · cbullinger · commit 89ecf99b3f64 · 2026-04-24T09:48:55.000-04:00
Bring the doc set up-to-date with what this PR ships so new devs and operators aren't figuring out a live feature set from source. - README.md: new "Operator UI" section under Monitoring covering enable flags, role mapping (admin/maintain → operator, write/triage/read → writer), per-repo replay authorization, and AI-suggester providers. Enhanced Features list gains an "Operator UI" group. Tools list gains test-llm and test-pem entries. - docs/CONFIG-REFERENCE.md: new "Operator UI" and "AI Rule Suggester (LLM)" env-var tables covering OPERATOR_UI_ENABLED, OPERATOR_AUTH_REPO, OPERATOR_REPO_SLUG, OPERATOR_RELEASE_*, LLM_PROVIDER, LLM_BASE_URL, LLM_MODEL, ANTHROPIC_API_KEY, ANTHROPIC_API_KEY_SECRET_NAME. Calls out the 30/hour/PAT rate limit on /suggest-rule. - docs/DEPLOYMENT.md: Secret Manager step #4 for anthropic-api-key plus the IAM binding; pre-deploy checklist gains the operator-UI auth repo bullet; post-deploy smoke test for the operator UI + AI settings. - docs/LOCAL-TESTING.md: "Optional (for Operator UI + AI rule suggester)" env-var block and a step-by-step "Testing the Operator UI Locally" section that points at cmd/test-llm for provider verification. - docs/FAQ.md: new "Operator UI" section (what it is, who can access, how the AI suggester works, how to debug "not connected"). - AGENT.md: full rewrite. Expanded file map covers all operator_*.go, llm_*.go, web/operator/index.html embed, webhook_trace_buffer, and log_buffer. New sections on authorization model, security posture (auth fail-closed, PAT hashing, SSRF defense-in-depth, LLM cost cap), and edit patterns for operator UI / LLM provider work. Key doc table rebuilt with clickable links.
diff --git a/AGENT.md b/AGENT.md
diff --git a/README.md b/README.md
@@ -29,6 +29,13 @@ A GitHub app that automatically copies code examples and files from source repos
 - **Development Tools** - Dry-run mode, CLI validation, enhanced logging
 - **Thread-Safe** - Concurrent webhook processing with proper state management
 
+### Operator UI
+- **Web dashboard at `/operator/`** - Five-tab UI (Overview, Webhooks, Audit, Workflows, System) with dark mode, keyboard shortcuts, and shareable URLs
+- **GitHub PAT authentication** - Users sign in with their personal access token; role is derived from their permission on a configured auth repo (`admin`/`maintain` → operator, `write`/`triage`/`read` → writer)
+- **Per-repo replay authorization** - Replay requires the caller's PAT to have read access to the source repo of the webhook being replayed
+- **Writer-facing tools** - Workflow browser, PR lookup, recent copies feed, file match tester, audit drawer, per-delivery log viewer
+- **AI rule suggester** - Paste a source/target pair; get a generated copier rule self-verified against the in-process pattern matcher. Two providers: [Anthropic](https://www.anthropic.com/) (hosted, default in prod via the Grove Foundry APIM gateway) or [Ollama](https://ollama.com) (local, for dev)
+
 ## 🚀 Quick Start
 
 ### Prerequisites
@@ -385,6 +392,47 @@ Get performance metrics:
 curl http://localhost:8080/metrics
 ```
 
+## Operator UI
+
+The operator UI is a web dashboard served from `/operator/` for diagnosing webhook processing, replaying failed deliveries, browsing workflows, and generating copier rules with AI assistance.
+
+### Enabling the UI
+
+Set the required env vars:
+
+```yaml
+OPERATOR_UI_ENABLED: "true"
+OPERATOR_AUTH_REPO: "your-org/some-repo"  # user permissions here determine role
+OPERATOR_REPO_SLUG: "your-org/some-repo"  # optional; enables audit-row deep links
+```
+
+**Startup fails** if `OPERATOR_UI_ENABLED=true` without `OPERATOR_AUTH_REPO` — this prevents an accidentally-open operator UI.
+
+### Authentication and roles
+
+Each user authenticates with their own **GitHub Personal Access Token**. Paste the PAT into the sign-in prompt; the server checks the user's permission on `OPERATOR_AUTH_REPO` and assigns a role:
+
+| GitHub permission | Operator UI role | Can do |
+|---|---|---|
+| `admin` / `maintain` | **operator** | View everything; replay deliveries; cut release tags; change AI settings |
+| `write` / `triage` / `read` | **writer** | View workflows, audit, recent copies, file match tester, AI rule suggester |
+| None | **denied** | 401 Unauthorized |
+
+`write` maps to writer (not operator) so typical docs contributors with repo write access can't replay deliveries or cut releases — those need an explicit `admin` / `maintain` grant.
+
+On top of the role, **replay is repo-scoped**: the user's PAT must also have read access to the source repo of the webhook being replayed.
+
+### AI rule suggester
+
+The operator UI includes an LLM-backed helper that takes a source/target file pair and returns a generated copier workflow rule, self-verified against the in-process pattern matcher before display.
+
+Two providers are supported via `LLM_PROVIDER`:
+
+- **`anthropic`** (default in Cloud Run): calls the Anthropic Messages API. For MongoDB deployments this routes through the Grove Foundry APIM gateway — set `LLM_BASE_URL=https://grove-gateway-prod.azure-api.net/grove-foundry-prod/anthropic` and load the gateway key from Secret Manager via `ANTHROPIC_API_KEY_SECRET_NAME`.
+- **`ollama`** (default for local dev): runs against a local Ollama instance at `http://localhost:11434`. Connect, pull models, and switch the active model from the UI's System → AI settings panel without a redeploy.
+
+Smoke-test the LLM provider end-to-end with [`cmd/test-llm`](cmd/test-llm/README.md).
+
 ## Audit Logging
 
 When enabled, all operations are logged to MongoDB:
@@ -598,4 +646,6 @@ See [DEPLOYMENT.md](./docs/DEPLOYMENT.md) for the complete deployment and rollba
 
 - **[Config Validator](cmd/config-validator/README.md)** - CLI tool for validating configs
 - **[Test Webhook](cmd/test-webhook/README.md)** - CLI tool for testing webhooks
+- **[Test PEM](cmd/test-pem/README.md)** - CLI tool for verifying the GitHub App private key
+- **[Test LLM](cmd/test-llm/README.md)** - CLI tool for smoke-testing the AI rule suggester's LLM provider
 - **[Scripts](scripts/README.md)** - Helper scripts for deployment, testing, and releases
diff --git a/docs/CONFIG-REFERENCE.md b/docs/CONFIG-REFERENCE.md
@@ -15,6 +15,8 @@ Complete reference for all github-copier configuration options: environment vari
   - [Audit Logging](#audit-logging)
   - [GitHub API Tuning](#github-api-tuning)
   - [Webhook Processing](#webhook-processing)
+  - [Operator UI](#operator-ui)
+  - [AI Rule Suggester (LLM)](#ai-rule-suggester-llm)
   - [Google Cloud](#google-cloud)
 - [Workflow YAML Schema](#workflow-yaml-schema)
   - [Main Config](#main-config)
@@ -127,6 +129,32 @@ Set via `.env` files, `env-cloudrun.yaml`, or process environment.
 | `WEBHOOK_MAX_RETRIES` | int | `2` | Max retry attempts for failed webhook processing (total attempts = retries + 1). |
 | `WEBHOOK_RETRY_INITIAL_DELAY` | int | `5` | Initial delay between retries in **seconds** (doubles each attempt). |
 
+### Operator UI
+
+Mount the web dashboard at `/operator/` (see the [Operator UI section of the README](../README.md#operator-ui) for access model, roles, and feature overview). Off unless `OPERATOR_UI_ENABLED=true`.
+
+| Variable | Type | Default | Description |
+|----------|------|---------|-------------|
+| `OPERATOR_UI_ENABLED` | bool | `false` | Enable the operator UI routes (`/operator/*`). |
+| `OPERATOR_AUTH_REPO` | string | — | `owner/repo` — the user's permission on this repo determines their role (`admin`/`maintain` → operator, `write`/`triage`/`read` → writer). **Required** when the UI is enabled — startup fails otherwise. |
+| `OPERATOR_REPO_SLUG` | string | — | `owner/repo` used to build clickable GitHub links in audit/trace rows. Optional. |
+| `OPERATOR_RELEASE_GITHUB_TOKEN` | string | — | PAT with `contents:write` used by the UI to create version tag refs. Optional; without it the release button is hidden. |
+| `OPERATOR_RELEASE_TARGET_BRANCH` | string | `main` | Branch whose HEAD SHA is tagged when cutting a release from the UI. |
+
+### AI Rule Suggester (LLM)
+
+Powers `/operator/api/suggest-rule`. The feature surface is always available when the operator UI is enabled; connectivity to the configured provider is checked at request time, and operators can switch model / base URL from the UI at runtime (process-global, reverts on restart).
+
+| Variable | Type | Default | Description |
+|----------|------|---------|-------------|
+| `LLM_PROVIDER` | string | `ollama` | Provider selector: `ollama` (local) or `anthropic` (hosted, default in Cloud Run). |
+| `LLM_BASE_URL` | string | per-provider | Provider endpoint. Default `http://localhost:11434` for Ollama or `https://api.anthropic.com` for Anthropic. For MongoDB's Grove Foundry APIM gateway, use `https://grove-gateway-prod.azure-api.net/grove-foundry-prod/anthropic`. |
+| `LLM_MODEL` | string | per-provider | Initial active model. Default `qwen2.5-coder:7b` for Ollama or `claude-haiku-4-5` for Anthropic. |
+| `ANTHROPIC_API_KEY` | string | — | Anthropic API / gateway key. Loaded directly from the env for local dev. Ignored when `LLM_PROVIDER=ollama`. |
+| `ANTHROPIC_API_KEY_SECRET_NAME` | string | — | GCP Secret Manager name for the Anthropic key; used in Cloud Run so no key material is ever in env vars or YAML. Short name (e.g. `anthropic-api-key`) is resolved to a full path via `SecretPath()`. |
+
+The suggester is rate-limited to 30 requests/hour per authenticated user (keyed by hashed PAT) to cap provider cost. Denied requests return 429 with a `Retry-After` header.
+
 ### Google Cloud
 
 | Variable | Type | Default | Description |
diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md
@@ -164,6 +164,22 @@ echo -n "mongodb+srv://user:pass@cluster.mongodb.net/dbname" | \
   --replication-policy="automatic"
 ```
 
+#### 4. Anthropic API Key (Optional - for the AI rule suggester)
+
+Required only when the operator UI is enabled and `LLM_PROVIDER=anthropic` (the default in the committed CI deploy). Skip if you're using Ollama or don't plan to use the AI rule suggester.
+
+```bash
+# For the Grove Foundry APIM gateway, the value is the gateway key you were
+# issued — not a raw Anthropic sk-... key. The app sends it as both the
+# x-api-key (Anthropic) and api-key (APIM) header, so one key works either way.
+echo -n "$GATEWAY_KEY" | \
+  gcloud secrets create anthropic-api-key \
+  --data-file=- \
+  --replication-policy="automatic"
+```
+
+The env-var that points at this secret is `ANTHROPIC_API_KEY_SECRET_NAME=anthropic-api-key` (already set in `.github/workflows/ci.yml` and `env-cloudrun.yaml`). Missing key is non-fatal — the operator UI shows "not configured" and every other feature still works.
+
 ### Grant Cloud Run Access
 
 ```bash
@@ -185,6 +201,11 @@ gcloud secrets add-iam-policy-binding webhook-secret \
 gcloud secrets add-iam-policy-binding mongo-uri \
   --member="serviceAccount:${SERVICE_ACCOUNT}" \
   --role="roles/secretmanager.secretAccessor"
+
+# Only if using the AI rule suggester with LLM_PROVIDER=anthropic
+gcloud secrets add-iam-policy-binding anthropic-api-key \
+  --member="serviceAccount:${SERVICE_ACCOUNT}" \
+  --role="roles/secretmanager.secretAccessor"
 ```
 
 **Note:** Cloud Run uses the default compute service account by default. You can also create a dedicated service account for better security isolation.
@@ -322,11 +343,12 @@ services.LoadMongoURI(config)       // Loads from Secret Manager
 
 ### Pre-Deployment Checklist
 
-- [ ] Secrets created in Secret Manager
-- [ ] IAM permissions granted to Cloud Run service account
+- [ ] Secrets created in Secret Manager (`CODE_COPIER_PEM`, `webhook-secret`, `mongo-uri`, and `anthropic-api-key` if using the AI rule suggester)
+- [ ] IAM permissions granted to Cloud Run service account on each secret
 - [ ] `env-cloudrun.yaml` created and configured
 - [ ] `env-cloudrun.yaml` in `.gitignore`
 - [ ] `Dockerfile` exists in project root
+- [ ] (Operator UI) `OPERATOR_AUTH_REPO` points at a repo you own and can manage collaborators on — its permission list decides who gets operator vs writer access
 
 ### Deploy to Cloud Run
 
@@ -480,6 +502,16 @@ gcloud run services logs read github-copier --limit=50
 # ❌ "webhook signature verification failed"
 ```
 
+### Smoke-Test the Operator UI (if enabled)
+
+Only applicable when `OPERATOR_UI_ENABLED=true`:
+
+1. Open `https://<service-url>/operator/` in a browser.
+2. Generate a GitHub PAT with `repo` scope, paste it into the sign-in prompt.
+3. Confirm the user chip in the header shows your GitHub avatar and the correct role (`operator` if you're `admin`/`maintain` on `OPERATOR_AUTH_REPO`, `writer` if you're `write`/`triage`/`read`).
+4. Click the **System** tab → **AI settings** → **Refresh status**. You should see the provider connected (e.g. "Anthropic connected at https://grove-gateway-prod.azure-api.net/…").
+5. If AI settings shows "unreachable", the `anthropic-api-key` secret wasn't granted to the Cloud Run service account, or the deploy is pointing at a URL the gateway doesn't accept. Check the Cloud Run revision logs for `Anthropic API key not loaded` or a 401/403 from the gateway.
+
 ## Monitoring
 
 ### View Logs
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -30,6 +30,48 @@ The GitHub copier is a GitHub app that automatically copies code examples and fi
 - Health and metrics endpoints
 - Slack notifications
 - Dry-run mode for testing
+- **[Operator UI](../README.md#operator-ui)** — Web dashboard at `/operator/` for replay, audit browsing, workflow inspection, and AI-assisted rule generation
+
+## Operator UI
+
+### What is the operator UI?
+
+A web dashboard served from `/operator/` when `OPERATOR_UI_ENABLED=true`. Five tabs:
+
+- **Overview** — live metrics, recent activity, health of dependent services
+- **Webhooks** — recent webhook traces with filter/search and one-click replay
+- **Audit** — searchable audit event history with a per-event drawer (trace + logs + replay)
+- **Workflows** — browse the loaded copier config; test path matches with the built-in file match tester
+- **System** — deployment metadata, AI settings, release tagging
+
+### Who can access the operator UI?
+
+Anyone with a GitHub PAT that has access to `OPERATOR_AUTH_REPO`. The user's permission on that repo determines their UI role:
+
+- `admin` / `maintain` → **operator**: full access including replay, release, AI settings
+- `write` / `triage` / `read` → **writer**: view audit, workflows, recent copies, run the AI rule suggester and file match tester, but no replay / release
+- No access → 401 Unauthorized
+
+`write` is deliberately mapped to **writer** (not operator) so typical docs contributors can't replay deliveries or cut releases just by having repo write access. Operator capability requires an explicit `admin` / `maintain` grant.
+
+### How does the AI rule suggester work?
+
+Paste a source file path and the target file path you want; optionally name the target repo. The server sends the pair plus a structured prompt to the configured LLM, parses the returned JSON, and runs the generated rule through the in-process pattern matcher to verify it actually produces your target from your source. If it doesn't match, the UI shows a "not verified" warning next to the YAML so you can review before copying it into your config.
+
+Two providers are supported:
+
+- **Anthropic** (default in Cloud Run) — calls the hosted Messages API. In this repo's deploy it routes through the Grove Foundry APIM gateway so no infrastructure needs to be stood up.
+- **Ollama** (local dev) — runs against a local model server. The UI can pull models, switch the active one, and delete models without a redeploy.
+
+To cap cost, the suggester is rate-limited to 30 requests/hour per authenticated user.
+
+### The AI settings panel says "not connected" — how do I fix it?
+
+Check the banner at startup — it prints the active `AI Provider`, `AI Model`, and `AI URL`. Then:
+
+- **Anthropic**: make sure `ANTHROPIC_API_KEY` (local) or `ANTHROPIC_API_KEY_SECRET_NAME` (Cloud Run) is set. In Cloud Run, the runtime service account also needs `roles/secretmanager.secretAccessor` on the secret.
+- **Ollama**: confirm `ollama serve` is running on the host at `LLM_BASE_URL` (default `http://localhost:11434`) and that you've pulled a model.
+- Use [`cmd/test-llm`](../cmd/test-llm/README.md) to exercise the full path outside the UI — it reports Ping, ListModels, and a real GenerateJSON call.
 
 ## Configuration
 
diff --git a/docs/LOCAL-TESTING.md b/docs/LOCAL-TESTING.md
@@ -343,6 +343,43 @@ AUDIT_DATABASE=code_copier_dev
 AUDIT_COLLECTION=audit_events
 ```
 
+### Optional (for Operator UI + AI rule suggester)
+
+```bash
+# Mount the operator dashboard at http://localhost:8080/operator/
+OPERATOR_UI_ENABLED=true
+OPERATOR_AUTH_REPO=your-org/some-repo     # your GitHub permission here decides your UI role
+OPERATOR_REPO_SLUG=your-org/some-repo     # optional; enables clickable audit-row deep links
+
+# AI rule suggester — pick ONE provider:
+#
+# Option A: Ollama (local, no cloud calls, no API key needed)
+#   1. Install Ollama: https://ollama.com/download
+#   2. Leave LLM_PROVIDER unset — it defaults to ollama with http://localhost:11434
+#   3. From the UI's System → AI settings panel, pull a model (e.g. qwen2.5-coder:7b)
+#
+# Option B: Anthropic via Grove Foundry APIM gateway
+LLM_PROVIDER=anthropic
+LLM_BASE_URL=https://grove-gateway-prod.azure-api.net/grove-foundry-prod/anthropic
+LLM_MODEL=claude-haiku-4-5
+ANTHROPIC_API_KEY=<your-gateway-key>      # never commit this; use a local-only env file
+```
+
+### Testing the Operator UI Locally
+
+1. Start the app with the env vars above. The startup banner will confirm `Operator UI: true` and show the configured auth repo, AI provider, model, and base URL.
+2. Open `http://localhost:8080/operator/` in a browser.
+3. Generate a [GitHub Personal Access Token](https://github.com/settings/tokens) with `repo` scope. Paste it into the sign-in prompt. The UI caches it in `localStorage` so you only paste once.
+4. If you own `OPERATOR_AUTH_REPO`, grant yourself `admin` for the operator role, or `read`/`write` for the writer role — the header chip will show which one you got.
+5. Smoke-test the LLM connection end-to-end with `cmd/test-llm` before hitting the UI:
+
+   ```bash
+   go build -o test-llm ./cmd/test-llm
+   ./test-llm -env .env.test
+   ```
+
+   A successful run pings the provider, lists models, and issues a real rule-suggester prompt. See [cmd/test-llm/README.md](../cmd/test-llm/README.md) for details.
+
 ## Troubleshooting
 
 ### Error: "A JSON web token could not be decoded" / "Failed to configure GitHub permissions"