Commit 83fb668
lucasliu
release: v1.0.5
Prefix cache hardening + E2E test fixes.
Prefix cache (6 commits):
- async write + async eviction in SSDCacheStore (no more 100-500ms tail
latency stalls after generation)
- safetensors header-only reader replaces full-file scan at startup
(eliminates multi-GB I/O at model init)
- VLM streaming/non-streaming paths skip prefix cache fetch+store
(was wasted SSD I/O — VLM never used the result)
- pre-flight RotatingKVCache probe avoids loading SSD blocks for
sliding-window models (Gemma family) that can't use them
- ServerConfig.prefixCacheEnabled kill switch wired through to worker
- TTFT benchmark gated on NOVAMLX_BENCH=1
E2E tests:
- skip VLMs in text-only core API suite
- accept reasoning-only output from Harmony (gpt-oss) models
- bump smoke test max_tokens 50 -> 150 for thinking-channel budget1 parent 0dd90b4 commit 83fb668
1 file changed
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
0 commit comments