Skip to content

Fix server XTC crash from heterogeneous xtc_special_tokens#1258

Draft
odysa wants to merge 1 commit intoml-explore:mainfrom
odysa:fix/server-xtc-heterogeneous-list
Draft

Fix server XTC crash from heterogeneous xtc_special_tokens#1258
odysa wants to merge 1 commit intoml-explore:mainfrom
odysa:fix/server-xtc-heterogeneous-list

Conversation

@odysa
Copy link
Copy Markdown

@odysa odysa commented May 7, 2026

Summary

mlx_lm.server._make_sampler builds xtc_special_tokens as a heterogeneous list — [int, list[int]] — which MLX fancy-indexing in apply_xtc cannot handle. Every chat-completion request with xtc_probability > 0 crashes the generation worker with ValueError: Initialization encountered extra dimension. The client just sees a dropped connection.

This fix flattens the list to match the construction already used in generate.py:2070 and chat.py:156, and switches to eos_token_ids for multi-EOS tokenizers.

- xtc_special_tokens=[
-     tokenizer.eos_token_id,
-     tokenizer.encode("\n"),
- ],
+ xtc_special_tokens=tokenizer.encode("\n", add_special_tokens=False)
+ + list(tokenizer.eos_token_ids),

For Gemma the produced list goes from [1, [2, 107]] (broken) to [107, 1, 106] (works).

Reproduction (before this PR)

import mlx.core as mx
from mlx_lm.sample_utils import apply_xtc
apply_xtc(mx.zeros((1, 100)), 0.5, 0.1, xtc_special_tokens=[1, [2, 107]])
# ValueError: Initialization encountered extra dimension.

End-to-end via the server:

mlx_lm.server --model mlx-community/gemma-3-1b-it-4bit-DWQ --port 8080 &
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"default_model","messages":[{"role":"user","content":"hi"}],"xtc_probability":0.5,"xtc_threshold":0.1,"max_tokens":5}'
# server stderr: ValueError: Initialization encountered extra dimension.

Test plan

  • Repro apply_xtc crash on main.
  • Confirm fix produces a flat list[int] for Gemma 3 / Gemma 4 tokenizers.
  • Run sampler with the new list — no crash.
  • End-to-end server smoke test with xtc_probability > 0.
  • Existing tests pass.

`_make_sampler` constructed `xtc_special_tokens` as `[int, list[int]]`
(scalar `eos_token_id` + nested `tokenizer.encode("\n")`). MLX fancy-
indexing in `apply_xtc` (`mask[..., xtc_special_tokens] = False`) cannot
handle the nested list and raises `ValueError: Initialization
encountered extra dimension` on the first sampling step, crashing every
chat-completion request with `xtc_probability > 0`.

Match the flat-list construction already used in generate.py:2070 and
chat.py:156, and pass `add_special_tokens=False` so BOS isn't included.
Also covers multi-EOS tokenizers via `tokenizer.eos_token_ids`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant