[Qwen3.6] Stack per-expert MoE tensors during mlx_lm sanitize

sdayal · sdayal · commit 3b43ed7c7174 · 2026-04-28T18:28:40.000+02:00
Qwen-org Qwen3.6 MoE checkpoints (e.g. Qwen/Qwen3.6-35B-A3B-FP8) ship expert MLPs as one tensor per expert per projection: model.language_model.layers.{L}.mlp.experts.{E}.{gate,up,down}_proj.weight mlx_lm.qwen3_5_moe.Model.sanitize expects the combined-format layout (experts.gate_up_proj / experts.down_proj) produced by mlx_lm.convert and shipped by mlx-community redistributions. Loading a Qwen-org checkpoint fails strict load_weights with thousands of unexpected keys (30720 for the 35B-A3B variant: 256 experts x 40 layers x 3 projections). Extend the existing FP8 sanitize compat shim with a pre-step that detects per-expert MoE tensors, validates the index range is contiguous from 0, and stacks them (mx.stack along axis 0) into the combined experts.gate_up_proj + experts.down_proj form upstream sanitize already handles. Pre-stacked checkpoints are unaffected (helper short-circuits when no per-expert keys are present). This is the downstream complement to ml-explore/mlx-lm#1224, which adds the same stacking logic inline in qwen3_5_moe.Model.sanitize. When that lands and vllm-metal's mlx-lm pin bumps past it, this shim can be removed. Files: - vllm_metal/compat.py: add _stack_qwen36_moe_per_expert_weights helper, chained after FP8 dequant in the patched sanitize. - docs/supported_models.md: update Qwen3.6 row note. - tests/test_qwen36_smoke.py: opt-in e2e smoke gated on QWEN36_MOE_FP8_PATH env var (pytest skips by default; runs in 25s against a local Qwen3.6 MoE FP8 checkpoint). Verified end-to-end on Qwen/Qwen3.6-35B-A3B-FP8: greedy decode of "The capital of France is" returns " Paris, a city renowned for its iconic landmarks such" with the hybrid SDPA + GDN linear attention path on Apple Silicon Metal. Existing Qwen3.5 golden-token smoke (test_qwen35_smoke.py) unchanged: 5/5 pass.
diff --git a/docs/supported_models.md b/docs/supported_models.md
@@ -26,7 +26,7 @@ Metal. Qwen3 is explicitly covered by the paged prefix-cache e2e test.
 | --- | --- | --- | --- | --- | --- |
 | Qwen3 | ✅ | GQA (paged) | ✅ | [#232](https://github.com/vllm-project/vllm-metal/pull/232), [#237](https://github.com/vllm-project/vllm-metal/pull/237), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Validated by the paged prefix-cache e2e test |
 | Qwen3.5 | ✅ | Hybrid SDPA + GDN linear | ❌ | [#210](https://github.com/vllm-project/vllm-metal/pull/210), [#226](https://github.com/vllm-project/vllm-metal/pull/226), [#230](https://github.com/vllm-project/vllm-metal/pull/230), [#235](https://github.com/vllm-project/vllm-metal/pull/235), [#239](https://github.com/vllm-project/vllm-metal/pull/239), [#243](https://github.com/vllm-project/vllm-metal/pull/243), [#259](https://github.com/vllm-project/vllm-metal/pull/259), [#265](https://github.com/vllm-project/vllm-metal/pull/265), [#194](https://github.com/vllm-project/vllm-metal/issues/194) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
-| Qwen3.6 | ✅ | Hybrid SDPA + GDN linear (MoE) | ❌ |  | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
+| Qwen3.6 | ✅ | Hybrid SDPA + GDN linear (MoE) | ❌ |  | Verified on `Qwen/Qwen3.6-35B-A3B-FP8`. Per-expert MoE tensors stacked at sanitize. Upstream keeps automatic prefix caching off for hybrid/Mamba models |
 | Qwen3-Next | ✅ | Hybrid SDPA + GDN linear | ❌ | [#240](https://github.com/vllm-project/vllm-metal/pull/240) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
 | Gemma 4 | 🔵 | GQA + per-layer sliding window + YOCO | ✅ | [#251](https://github.com/vllm-project/vllm-metal/pull/251), [#260](https://github.com/vllm-project/vllm-metal/pull/260), [#269](https://github.com/vllm-project/vllm-metal/pull/269), [#275](https://github.com/vllm-project/vllm-metal/pull/275), [#277](https://github.com/vllm-project/vllm-metal/pull/277), [#278](https://github.com/vllm-project/vllm-metal/pull/278), [#282](https://github.com/vllm-project/vllm-metal/pull/282), [#276](https://github.com/vllm-project/vllm-metal/issues/276), [#279](https://github.com/vllm-project/vllm-metal/pull/279), [#281](https://github.com/vllm-project/vllm-metal/issues/281), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on for non-hybrid paged models; overall model support remains experimental |
 | Gemma 3 | ✅ | GQA (paged) | ✅ | [#283](https://github.com/vllm-project/vllm-metal/pull/283) | tested on gemma-3-1b-it-qat-4bit; gemma-3-4b-it-4bit verified for text-only generation with VLM image inputs bypassed |
diff --git a/tests/test_qwen36_smoke.py b/tests/test_qwen36_smoke.py
@@ -0,0 +1,72 @@
+# SPDX-License-Identifier: Apache-2.0
+"""End-to-end smoke for Qwen3.6 MoE FP8 (Qwen-org per-expert layout).
+
+Exercises the per-expert MoE stacking compat path in
+``vllm_metal.compat._stack_qwen36_moe_per_expert_weights`` plus FP8 dequant,
+hybrid SDPA + GDN linear attention, and paged KV cache. Skipped unless a local
+checkpoint is available, since the smallest Qwen3.6 MoE FP8 weight is ~35 GB
+and is not appropriate for CI.
+
+Run with a local checkpoint:
+
+    QWEN36_MOE_FP8_PATH=~/models/Qwen3.6-35B-A3B-FP8 \\
+        VLLM_ENABLE_V1_MULTIPROCESSING=0 \\
+        python -m pytest tests/test_qwen36_smoke.py -v -s
+"""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+import pytest
+
+MODEL_PATH_ENV = "QWEN36_MOE_FP8_PATH"
+MAX_TOKENS = 10
+PROMPT = "The capital of France is"
+
+
+def _resolved_model_path() -> Path | None:
+    raw = os.environ.get(MODEL_PATH_ENV)
+    if not raw:
+        return None
+    path = Path(os.path.expanduser(raw))
+    return path if path.is_dir() else None
+
+
+@pytest.fixture(autouse=True, scope="module")
+def _set_env():
+    with pytest.MonkeyPatch.context() as mp:
+        mp.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
+        if "VLLM_METAL_MEMORY_FRACTION" not in os.environ:
+            mp.setenv("VLLM_METAL_MEMORY_FRACTION", "auto")
+        yield
+
+
+@pytest.mark.slow
+def test_qwen36_moe_fp8_generates():
+    model_path = _resolved_model_path()
+    if model_path is None:
+        pytest.skip(
+            f"Set {MODEL_PATH_ENV} to a Qwen3.6 MoE FP8 checkpoint directory to run."
+        )
+
+    from vllm import LLM, SamplingParams
+
+    llm = LLM(model=str(model_path), max_model_len=512, max_num_seqs=1)
+    sp = SamplingParams(temperature=0, max_tokens=MAX_TOKENS)
+    outputs = llm.generate([PROMPT], sp)
+
+    text = outputs[0].outputs[0].text
+    token_ids = list(outputs[0].outputs[0].token_ids)
+
+    print(f"\n  prompt: {PROMPT!r}")
+    print(f"  output: {text!r}")
+    print(f"  ids:    {token_ids}")
+
+    # Loose factual assertion: greedy decode of "The capital of France is" must
+    # surface "Paris" within the first MAX_TOKENS tokens for any reasonable
+    # Qwen3.6-A3B variant. Tighter golden IDs would be brittle across mlx
+    # versions and quant formats.
+    assert len(token_ids) == MAX_TOKENS, f"expected {MAX_TOKENS} tokens, got {token_ids}"
+    assert "Paris" in text, f"expected 'Paris' in greedy output, got {text!r}"
diff --git a/vllm_metal/compat.py b/vllm_metal/compat.py
@@ -131,6 +131,79 @@ def _dequantize_qwen35_fp8_weights(
     return new_weights
 
 
+def _stack_qwen36_moe_per_expert_weights(
+    weights: Mapping[str, Any], mx: Any
+) -> Mapping[str, Any]:
+    """Combine per-expert MoE tensors into the stacked layout mlx_lm expects.
+
+    Qwen-org Qwen3.6 MoE checkpoints (e.g. ``Qwen/Qwen3.6-35B-A3B-FP8``) ship
+    expert MLPs as one tensor per expert per projection:
+    ``...mlp.experts.{E}.{gate,up,down}_proj.weight``. ``mlx_lm.qwen3_5_moe``'s
+    ``sanitize`` expects them already concatenated as
+    ``...mlp.experts.gate_up_proj`` (gate then up along the intermediate axis)
+    and ``...mlp.experts.down_proj``, both stacked along axis 0 over experts.
+
+    No-op when no per-expert keys are present (dense Qwen3.5/3.6 or already-
+    stacked MoE checkpoints).
+    """
+    experts_marker = ".mlp.experts."
+    proj_suffixes = (".gate_proj.weight", ".up_proj.weight", ".down_proj.weight")
+    groups: dict[str, dict[str, dict[int, Any]]] = {}
+    consumed: set[str] = set()
+    for key in weights:
+        marker_pos = key.find(experts_marker)
+        if marker_pos == -1:
+            continue
+        suffix = next((s for s in proj_suffixes if key.endswith(s)), None)
+        if suffix is None:
+            continue
+        index_start = marker_pos + len(experts_marker)
+        index_end = len(key) - len(suffix)
+        index_str = key[index_start:index_end]
+        if not index_str.isdigit():
+            continue
+        prefix = key[: marker_pos + len(".mlp.experts")]
+        proj = suffix[1:-len(".weight")]  # ".gate_proj.weight" -> "gate_proj"
+        groups.setdefault(prefix, {}).setdefault(proj, {})[int(index_str)] = weights[key]
+        consumed.add(key)
+
+    if not groups:
+        return weights
+
+    logger.debug(
+        "Stacking per-expert MoE tensors at %d prefixes (%d tensors consumed)",
+        len(groups),
+        len(consumed),
+    )
+    new_weights = {k: v for k, v in weights.items() if k not in consumed}
+    for prefix, proj_to_experts in groups.items():
+        missing = {"gate_proj", "up_proj", "down_proj"} - proj_to_experts.keys()
+        if missing:
+            raise ValueError(
+                f"Incomplete per-expert MoE tensors at {prefix!r}: "
+                f"missing projections {sorted(missing)}."
+            )
+        expert_indices = sorted(proj_to_experts["gate_proj"].keys())
+        expected = list(range(len(expert_indices)))
+        if expert_indices != expected:
+            raise ValueError(
+                f"Non-contiguous per-expert MoE indices at {prefix!r}: "
+                f"got {expert_indices[:3]}…{expert_indices[-3:]}, "
+                f"expected 0..{len(expected) - 1}."
+            )
+        for proj in ("up_proj", "down_proj"):
+            if sorted(proj_to_experts[proj].keys()) != expert_indices:
+                raise ValueError(
+                    f"Per-expert MoE index mismatch at {prefix!r}.{proj}."
+                )
+        gate = mx.stack([proj_to_experts["gate_proj"][i] for i in expert_indices])
+        up = mx.stack([proj_to_experts["up_proj"][i] for i in expert_indices])
+        down = mx.stack([proj_to_experts["down_proj"][i] for i in expert_indices])
+        new_weights[f"{prefix}.gate_up_proj"] = mx.concatenate([gate, up], axis=-2)
+        new_weights[f"{prefix}.down_proj"] = down
+    return new_weights
+
+
 def _patch_mlx_lm_qwen35_fp8_sanitize() -> None:
     """Teach mlx_lm's Qwen3.5 loaders to consume local FP8 ``weight_scale_inv``.
 
@@ -177,11 +250,16 @@ def _patch_mlx_lm_qwen35_fp8_sanitize() -> None:
         )
         return
 
+    def _transform(_self, weights):
+        weights = _dequantize_qwen35_fp8_weights(weights, mx)
+        weights = _stack_qwen36_moe_per_expert_weights(weights, mx)
+        return weights
+
     def _patch_model_sanitize(model_cls) -> bool:
         return _wrap_model_sanitize(
             model_cls,
             "_vllm_metal_qwen35_fp8_patch",
-            lambda _self, weights: _dequantize_qwen35_fp8_weights(weights, mx),
+            _transform,
         )
 
     patched_modules = []