Skip to content

feat: lazy daemon for warm model inference (~1-2s instead of 20-60s) #30

@pszymkowiak

Description

@pszymkowiak

Problem

Neural TTS backends (voxtream, qwen-native, qwen) spend 90%+ of wall time loading models, not generating audio:

Backend Total time Model loading Actual inference
voxtream (CUDA) 22s ~18s ~2s
voxtream (M2 Pro) 8s warm ~5s ~2s
qwen-native (CPU) 11m33s ~11m ~3s

The VoXtream2 paper reports 74ms first-packet latency with the model already loaded in GPU RAM. Our overhead is Python startup + model loading on every call.

Solution

Implement a lazy daemon (like ollama serve) that keeps models warm in memory:

First call:    vox -b voxtream "text"
               → daemon not running? start it, load model (~15s one-time)
               → generate audio (~1-2s)

Next calls:    vox -b voxtream "text"
               → daemon already warm
               → generate audio (~1-2s)

After idle:    daemon auto-stops after 5-10min (frees VRAM)

Architecture

  • Local HTTP or Unix socket server (vox daemon)
  • PID file in config dir for lifecycle management
  • Auto-start on first call to a heavy backend
  • Auto-shutdown after configurable idle timeout (default 5min)
  • Supports all heavy backends: voxtream, qwen-native, qwen
  • say and kokoro bypass the daemon (already fast enough)

API

POST /speak  { text, voice, lang, backend, ... } → 200 OK (audio played)
GET  /health → 200 OK { uptime, loaded_models, vram_usage }
POST /stop   → shutdown

CLI integration

vox daemon start           # Manual start (optional)
vox daemon stop            # Manual stop
vox daemon status          # Show loaded models, uptime, VRAM
vox config set daemon true # Enable auto-start (default: false)

When daemon mode is enabled, vox -b voxtream "text" transparently routes through the daemon instead of spawning a subprocess.

Expected performance

Backend Without daemon With daemon
voxtream (CUDA) 22s ~1-2s
voxtream (M2 Pro) 8s ~2-3s
qwen-native (Metal) 30s ~2-3s

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions