Fix ArraysCache missing is_trimmable/trim for hybrid model prompt cache by EagerofLight · Pull Request #1254 · ml-explore/mlx-lm

EagerofLight · 2026-05-06T22:22:28Z

Problem

Hybrid models (Qwen3.5, Qwen3.6, Qwen3-Next) mix ArraysCache (DeltaNet recurrent layers)
with KVCache (attention layers). ArraysCache inherits is_trimmable() → False from
_BaseCache and has no trim() method, so can_trim_prompt_cache() returns False for
the entire hybrid cache.

This silently breaks:

Prefix cache reuse in mlx_lm.server — fetch_nearest_cache() skips the trim path
LRU prefix dedup — insert_cache() can't evict redundant prefixes
Speculative decoding — generate.py raises ValueError on non-trimmable caches

Root cause

# cache.py — _BaseCache default
def is_trimmable(self):
    return False  # ArraysCache inherits this

# cache.py:88 — gate check
def can_trim_prompt_cache(cache):
    return all(c.is_trimmable() for c in cache)  # one False → entire cache rejected

Solution

Add is_trimmable() → True and trim(n) to ArraysCache.

Recurrent state is a compressed summary of all past tokens — it can't be partially
rolled back like KV cache. So trim(n) resets the recurrent state to empty; the
recurrent layers recompute on next forward while KVCache layers still benefit from
their own trim. For exact-match reuse (same prompt repeated), trim() is never
called and the full state is preserved.

Verification (Qwen3.5-4B-4bit, M4 16GB, `mlx_lm.server`)

Request	TTFT	Cached
Cold (571 tok)	2030 ms	0 / 571
Exact match	314 ms	570 / 571
Prefix reuse	397 ms	424 / 431
Extended prompt	461 ms	564 / 577

Before this fix, prefix reuse and extended prompt got 0 cached tokens.

Fixes #1162

Fix ArraysCache missing is_trimmable/trim for hybrid model prompt cache

39eae9c

LDMB123 mentioned this pull request May 7, 2026

mlx_lm.server enters cascading wedge state under concurrent retry+heartbeat load (sliding-window models) #1255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ArraysCache missing is_trimmable/trim for hybrid model prompt cache#1254

Fix ArraysCache missing is_trimmable/trim for hybrid model prompt cache#1254
EagerofLight wants to merge 1 commit intoml-explore:mainfrom
EagerofLight:fix/arrays-cache-trimmable

EagerofLight commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EagerofLight commented May 6, 2026

Problem

Root cause

Solution

Verification (Qwen3.5-4B-4bit, M4 16GB, mlx_lm.server)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Verification (Qwen3.5-4B-4bit, M4 16GB, `mlx_lm.server`)