Problem
Neural TTS backends (voxtream, qwen-native, qwen) spend 90%+ of wall time loading models, not generating audio:
| Backend |
Total time |
Model loading |
Actual inference |
| voxtream (CUDA) |
22s |
~18s |
~2s |
| voxtream (M2 Pro) |
8s warm |
~5s |
~2s |
| qwen-native (CPU) |
11m33s |
~11m |
~3s |
The VoXtream2 paper reports 74ms first-packet latency with the model already loaded in GPU RAM. Our overhead is Python startup + model loading on every call.
Solution
Implement a lazy daemon (like ollama serve) that keeps models warm in memory:
First call: vox -b voxtream "text"
→ daemon not running? start it, load model (~15s one-time)
→ generate audio (~1-2s)
Next calls: vox -b voxtream "text"
→ daemon already warm
→ generate audio (~1-2s)
After idle: daemon auto-stops after 5-10min (frees VRAM)
Architecture
- Local HTTP or Unix socket server (
vox daemon)
- PID file in config dir for lifecycle management
- Auto-start on first call to a heavy backend
- Auto-shutdown after configurable idle timeout (default 5min)
- Supports all heavy backends: voxtream, qwen-native, qwen
say and kokoro bypass the daemon (already fast enough)
API
POST /speak { text, voice, lang, backend, ... } → 200 OK (audio played)
GET /health → 200 OK { uptime, loaded_models, vram_usage }
POST /stop → shutdown
CLI integration
vox daemon start # Manual start (optional)
vox daemon stop # Manual stop
vox daemon status # Show loaded models, uptime, VRAM
vox config set daemon true # Enable auto-start (default: false)
When daemon mode is enabled, vox -b voxtream "text" transparently routes through the daemon instead of spawning a subprocess.
Expected performance
| Backend |
Without daemon |
With daemon |
| voxtream (CUDA) |
22s |
~1-2s |
| voxtream (M2 Pro) |
8s |
~2-3s |
| qwen-native (Metal) |
30s |
~2-3s |
References
Problem
Neural TTS backends (voxtream, qwen-native, qwen) spend 90%+ of wall time loading models, not generating audio:
The VoXtream2 paper reports 74ms first-packet latency with the model already loaded in GPU RAM. Our overhead is Python startup + model loading on every call.
Solution
Implement a lazy daemon (like
ollama serve) that keeps models warm in memory:Architecture
vox daemon)sayandkokorobypass the daemon (already fast enough)API
CLI integration
When daemon mode is enabled,
vox -b voxtream "text"transparently routes through the daemon instead of spawning a subprocess.Expected performance
References
voxtream-server) that could be reused