feat: lazy daemon for warm model inference (~1-2s instead of 20-60s)

## Problem

Neural TTS backends (voxtream, qwen-native, qwen) spend 90%+ of wall time loading models, not generating audio:

| Backend | Total time | Model loading | Actual inference |
|---------|-----------|--------------|-----------------|
| voxtream (CUDA) | 22s | ~18s | ~2s |
| voxtream (M2 Pro) | 8s warm | ~5s | ~2s |
| qwen-native (CPU) | 11m33s | ~11m | ~3s |

The VoXtream2 paper reports 74ms first-packet latency with the model already loaded in GPU RAM. Our overhead is Python startup + model loading on every call.

## Solution

Implement a lazy daemon (like `ollama serve`) that keeps models warm in memory:

```
First call:    vox -b voxtream "text"
               → daemon not running? start it, load model (~15s one-time)
               → generate audio (~1-2s)

Next calls:    vox -b voxtream "text"
               → daemon already warm
               → generate audio (~1-2s)

After idle:    daemon auto-stops after 5-10min (frees VRAM)
```

### Architecture

- Local HTTP or Unix socket server (`vox daemon`)
- PID file in config dir for lifecycle management
- Auto-start on first call to a heavy backend
- Auto-shutdown after configurable idle timeout (default 5min)
- Supports all heavy backends: voxtream, qwen-native, qwen
- `say` and `kokoro` bypass the daemon (already fast enough)

### API

```
POST /speak  { text, voice, lang, backend, ... } → 200 OK (audio played)
GET  /health → 200 OK { uptime, loaded_models, vram_usage }
POST /stop   → shutdown
```

### CLI integration

```bash
vox daemon start           # Manual start (optional)
vox daemon stop            # Manual stop
vox daemon status          # Show loaded models, uptime, VRAM
vox config set daemon true # Enable auto-start (default: false)
```

When daemon mode is enabled, `vox -b voxtream "text"` transparently routes through the daemon instead of spawning a subprocess.

### Expected performance

| Backend | Without daemon | With daemon |
|---------|---------------|-------------|
| voxtream (CUDA) | 22s | **~1-2s** |
| voxtream (M2 Pro) | 8s | **~2-3s** |
| qwen-native (Metal) | 30s | **~2-3s** |

### References

- ollama lazy model loading: https://github.com/ollama/ollama
- VoXtream2 paper: 74ms first-packet on RTX 3090 (model pre-loaded)
- VoXtream2 includes a WebSocket server (`voxtream-server`) that could be reused

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lazy daemon for warm model inference (~1-2s instead of 20-60s) #30

Problem

Solution

Architecture

API

CLI integration

Expected performance

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Backend	Total time	Model loading	Actual inference
voxtream (CUDA)	22s	~18s	~2s
voxtream (M2 Pro)	8s warm	~5s	~2s
qwen-native (CPU)	11m33s	~11m	~3s

Backend	Without daemon	With daemon
voxtream (CUDA)	22s	~1-2s
voxtream (M2 Pro)	8s	~2-3s
qwen-native (Metal)	30s	~2-3s

feat: lazy daemon for warm model inference (~1-2s instead of 20-60s) #30

Description

Problem

Solution

Architecture

API

CLI integration

Expected performance

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions