Skip to content

Fix ArraysCache missing is_trimmable/trim for hybrid model prompt cache#1254

Open
EagerofLight wants to merge 1 commit intoml-explore:mainfrom
EagerofLight:fix/arrays-cache-trimmable
Open

Fix ArraysCache missing is_trimmable/trim for hybrid model prompt cache#1254
EagerofLight wants to merge 1 commit intoml-explore:mainfrom
EagerofLight:fix/arrays-cache-trimmable

Conversation

@EagerofLight
Copy link
Copy Markdown

Problem

Hybrid models (Qwen3.5, Qwen3.6, Qwen3-Next) mix ArraysCache (DeltaNet recurrent layers)
with KVCache (attention layers). ArraysCache inherits is_trimmable() → False from
_BaseCache and has no trim() method, so can_trim_prompt_cache() returns False for
the entire hybrid cache.

This silently breaks:

  • Prefix cache reuse in mlx_lm.serverfetch_nearest_cache() skips the trim path
  • LRU prefix dedupinsert_cache() can't evict redundant prefixes
  • Speculative decodinggenerate.py raises ValueError on non-trimmable caches

Root cause

# cache.py — _BaseCache default
def is_trimmable(self):
    return False  # ArraysCache inherits this

# cache.py:88 — gate check
def can_trim_prompt_cache(cache):
    return all(c.is_trimmable() for c in cache)  # one False → entire cache rejected

Solution

Add is_trimmable()True and trim(n) to ArraysCache.

Recurrent state is a compressed summary of all past tokens — it can't be partially
rolled back like KV cache. So trim(n) resets the recurrent state to empty; the
recurrent layers recompute on next forward while KVCache layers still benefit from
their own trim. For exact-match reuse (same prompt repeated), trim() is never
called and the full state is preserved.

Verification (Qwen3.5-4B-4bit, M4 16GB, mlx_lm.server)

Request TTFT Cached
Cold (571 tok) 2030 ms 0 / 571
Exact match 314 ms 570 / 571
Prefix reuse 397 ms 424 / 431
Extended prompt 461 ms 564 / 577

Before this fix, prefix reuse and extended prompt got 0 cached tokens.

Fixes #1162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants