release: v1.0.5

lucasliu · lucasliu · commit 83fb6684108b · 2026-05-05T10:54:55.000-04:00
Prefix cache hardening + E2E test fixes.

Prefix cache (6 commits):
- async write + async eviction in SSDCacheStore (no more 100-500ms tail
  latency stalls after generation)
- safetensors header-only reader replaces full-file scan at startup
  (eliminates multi-GB I/O at model init)
- VLM streaming/non-streaming paths skip prefix cache fetch+store
  (was wasted SSD I/O — VLM never used the result)
- pre-flight RotatingKVCache probe avoids loading SSD blocks for
  sliding-window models (Gemma family) that can't use them
- ServerConfig.prefixCacheEnabled kill switch wired through to worker
- TTFT benchmark gated on NOVAMLX_BENCH=1

E2E tests:
- skip VLMs in text-only core API suite
- accept reasoning-only output from Harmony (gpt-oss) models
- bump smoke test max_tokens 50 -&gt; 150 for thinking-channel budget
diff --git a/Sources/NovaMLXCore/Types.swift b/Sources/NovaMLXCore/Types.swift
@@ -3,7 +3,7 @@ import Logging
 
 public enum NovaMLX {}
 
-public let version = "1.0.0"
+public let version = "1.0.5"
 
 public var buildTimestamp: String {
     guard let execURL = Bundle.main.executableURL,