No isolated test environment for infrastructure changes

## What's wrong

No way to validate infrastructure changes (Dockerfiles, compose configs, volume mounts, cross-service networking) before deploying. PR #216 needed 4 iterative fixes on the live server — all were runtime issues invisible to unit tests and config validation:

1. `.ssh` dir owned by root (container runs as uid 1000)
2. SSH ignores `HOME` env var, uses `/etc/passwd` for home directory resolution
3. No git user identity configured in ephemeral container
4. `git reset --hard` wiping files written before the reset

The current workaround — SSH into the server, checkout a branch, build, and test — is risky and not reproducible.

## What done looks like

A `mise run test:smoke` command validates that changed stacks **actually work** before merging. Specifically:

- `scripts/smoke-test.sh <stack>` builds, starts, health-checks, and tears down a stack in an isolated Docker Compose project (`-p test-<stack>`)
- Test stacks join a shared `test-homelab` Docker network for cross-stack communication via DNS (replacing Tailscale IP references)
- `compose.test.yaml` overrides per stack: no host port binds, dummy env vars, shared test network
- A GHA workflow runs smoke tests on PRs touching `stacks/{agents,observability,knowledge}/`
- ADR-018 documents the strategy and limitations

Focused on 3 stacks under active development: **agents, observability, knowledge**. Pre-built stacks (HA, MQTT, CrowdSec, flight-tracker) are out of scope.

## What the agent can't discover

**Cross-stack dependencies (from analysis):**
- Alloy config has exactly 2 Tailscale IP references that need overriding for test mode:
  - `100.100.146.119:6060` → CrowdSec metrics scrape target
  - `100.100.146.119:8585` → Agent service metrics scrape target
- These need a `config.test.alloy` that uses Docker DNS names on the shared `test-homelab` network
- All other cross-stack communication is internal to each compose project (Grafana→Prometheus, knowledge ingest→postgres)

**Docker Compose project isolation:**
- `-p test-<stack>` automatically prefixes volume names (e.g., `test-agents_repo-cache`), preventing data conflicts
- Port conflicts are impossible — test stacks don't bind to host ports at all
- Container names are prefixed too, so no naming collisions

**Constraints:**
- Agent worker spawning shares the Docker daemon with production — smoke tests should verify the API starts and responds to health checks, not test full worker lifecycle
- The GHA workflow must be human-authored (agents cannot modify `.github/workflows/`)
- Server has ~12GB free RAM; production containers use ~2GB total. A full parallel test stack is well within budget.

## What must not break

- Production stacks must not be affected (no shared ports, no shared volumes, no shared container names)
- `mise run ci` stays fast (~60s) — smoke tests are a separate workflow/task
- Deploy workflow is unchanged
- Teardown must be bulletproof (trap in script, always step in GHA) — leftover test containers waste resources

## Deliverables

1. `docs/decisions/018-isolated-test-environments.md` — ADR
2. `stacks/agents/compose.test.yaml` — test overrides
3. `stacks/observability/compose.test.yaml` + `config.test.alloy` — test overrides with DNS-based scrape targets
4. `stacks/knowledge/compose.test.yaml` — test overrides
5. `scripts/smoke-test.sh` — build/start/health-check/teardown orchestrator
6. `mise.toml` — `test:smoke` and `test:smoke:<stack>` tasks
7. `.github/workflows/smoke-test.yaml` — PR-triggered smoke tests (human-authored, see below)

## Out of scope (human follow-up)

- `.github/workflows/smoke-test.yaml` — agents cannot modify workflow files. Write this after the smoke-test script is validated.
- Agent self-validation (future: agents run `scripts/smoke-test.sh` from their worktree before creating PRs)
- Home Assistant, MQTT, CrowdSec, flight-tracker smoke tests (pre-built images, rarely break)
- Cloudflare tunnel testing (requires real token)
- Full end-to-end tests (SSH push to GitHub, Copilot CLI calls)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No isolated test environment for infrastructure changes #217

What's wrong

What done looks like

What the agent can't discover

What must not break

Deliverables

Out of scope (human follow-up)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

No isolated test environment for infrastructure changes #217

Description

What's wrong

What done looks like

What the agent can't discover

What must not break

Deliverables

Out of scope (human follow-up)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions