[Qwen3.6] Stack per-expert MoE tensors during mlx_lm sanitize (#312)

sdayal · web-flow · commit 197215d2c3ee · 2026-04-30T17:45:37.000+08:00
Tracking issue: #289 ## Summary `Qwen/Qwen3.6-35B-A3B-FP8` ships expert MLPs as one tensor per expert per projection: ``` model.language_model.layers.{L}.mlp.experts.{E}.{gate,up,down}_proj.weight ``` The bf16 master `Qwen/Qwen3.6-35B-A3B` is already pre-stacked (`experts.gate_up_proj` / `experts.down_proj`) and loads via the existing combined-format branch in `mlx_lm.qwen3_5_moe.Model.sanitize` unchanged — only the FP8 release lands per-expert, likely because Qwen's FP8 quantization pipeline runs per-expert and the artifact is not re-stacked. On vllm-metal `main`, loading `Qwen/Qwen3.6-35B-A3B-FP8` fails strict `load_weights` with `Received 30720 parameters not in model` — these are per-expert MoE tensors that vllm-metal's existing FP8 dequant compat doesn't address. (For reference: the same checkpoint on vanilla mlx-lm fails with 61,690 keys, the difference being 30,970 FP8 `weight_scale_inv` tensors that vllm-metal's `compat.py::_dequantize_qwen35_fp8_weights` already handles separately.) ## What this PR does Add `_stack_qwen36_moe_per_expert_weights` chained after FP8 dequant in the MoE sanitize wrapper: 1. **Scan** the weights dict for per-layer experts prefixes and their expert-index sets. 2. **Validate** each prefix's index set is a contiguous `{0, 1, …, N-1}` (raises `ValueError` otherwise). 3. **Walk** the per-expert tensors in order, `mx.stack` along axis 0, `mx.concatenate` gate+up along the intermediate-dim axis, emit the combined `experts.gate_up_proj` / `experts.down_proj` form upstream sanitize already handles. Pre-stacked checkpoints are unaffected (helper short-circuits when no per-expert keys are present). The MoE-only nature of the stacking is made explicit by splitting the sanitize patch by model class: - `mlx_lm.models.qwen3_5.Model` → wrapped with **FP8 dequant only** (`_transform_dense`) - `mlx_lm.models.qwen3_5_moe.Model` → wrapped with **FP8 dequant + per-expert stacking** (`_transform_moe`) Routing is driven by an explicit `transforms_by_module` map; future Qwen variants added without a corresponding entry get logged as `unpatchable` rather than silently inheriting one of the two transforms. ## Why a vllm-metal compat shim instead of waiting for upstream mlx-community publishes pre-stacked redistributions of this checkpoint that already load on existing mlx-lm. This shim lets users load Qwen-org's canonical FP8 artifact directly without a 35GB→70GB bf16 intermediate conversion step that doesn't fit on memory-constrained Macs (≤64 GB). This complements ml-explore/mlx-lm#1224, which adds the same per-expert stacking logic plus FP8 `weight_scale_inv` dequant for the qwen3_5 family inline in upstream sanitize. Once mlx-lm#1224 lands and vllm-metal's mlx-lm pin bumps past a release containing it, both this PR's per-expert stacking shim and the existing `_dequantize_qwen35_fp8_weights` shim in `compat.py` become removable in a follow-up cleanup. ## Files - `vllm_metal/compat.py` — add `_stack_qwen36_moe_per_expert_weights` helper; split sanitize patches into per-class transforms via `transforms_by_module`. - `docs/supported_models.md` — update Qwen3.6 row note + link this PR. - `tests/test_compat.py` — four new unit tests using the existing numpy-fake-mlx fixture (no real model weights, runs in milliseconds): - `test_per_expert_moe_tensors_stack_to_combined` — positive: per-expert input produces correctly stacked combined output, content preserved per axis-0 slot. - `test_pre_stacked_moe_is_noop_for_per_expert_helper` — regression: pre-stacked input passes through unchanged (covers mlx-community redistributions and Qwen3.6 bf16 master). - `test_non_contiguous_per_expert_indices_raise` — defensive: malformed `{0, 1, 3}` checkpoint raises `ValueError`. - `test_per_expert_helper_does_not_run_on_dense_qwen35` — architecture invariant: dense path doesn't run the MoE helper. ## Verification (per #289 pass bar) | Checkpoint | Hardware | Status | Output | |---|---|---|---| | `Qwen/Qwen3.6-35B-A3B-FP8` | M3 Max / 128 GB | ✅ loads, generates | `"The capital of France is"` → `" Paris, a city renowned for its iconic landmarks such"` | | `Qwen/Qwen3.6-35B-A3B` (bf16) | M3 Max / 128 GB | ✅ loads via unchanged combined-format branch (both shims dormant — confirms the new branch is properly gated) | same output | - Hybrid SDPA + GDN linear attention path on Apple Silicon Metal, paged KV cache. - `pytest tests/test_compat.py`: **15 passed, 1 skipped** (the skip is the pre-existing `VLLM_METAL_RUN_REAL_MLX_FP8_TESTS=1`-gated test). - Existing Qwen3.5 golden-token smoke (`test_qwen35_smoke.py`): 5/5 pass, unchanged. - `bash scripts/lint.sh`: clean (shellcheck, ruff check, ruff format --check, mypy). Rebased on latest `main` (3323d32) and re-validated against the bumped dep stack: `mlx-lm 0.31.3` (from #313), `vllm 0.20.0+cpu` (from #262), `transformers 5.7.0`. --------- Signed-off-by: Shivendra Dayal <sdayal@gmail.com>
diff --git a/docs/supported_models.md b/docs/supported_models.md
@@ -26,7 +26,7 @@ Metal. Qwen3 is explicitly covered by the paged prefix-cache e2e test.
 | --- | --- | --- | --- | --- | --- |
 | Qwen3 | ✅ | GQA (paged) | ✅ | [#232](https://github.com/vllm-project/vllm-metal/pull/232), [#237](https://github.com/vllm-project/vllm-metal/pull/237), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Validated by the paged prefix-cache e2e test |
 | Qwen3.5 | ✅ | Hybrid SDPA + GDN linear | ❌ | [#210](https://github.com/vllm-project/vllm-metal/pull/210), [#226](https://github.com/vllm-project/vllm-metal/pull/226), [#230](https://github.com/vllm-project/vllm-metal/pull/230), [#235](https://github.com/vllm-project/vllm-metal/pull/235), [#239](https://github.com/vllm-project/vllm-metal/pull/239), [#243](https://github.com/vllm-project/vllm-metal/pull/243), [#259](https://github.com/vllm-project/vllm-metal/pull/259), [#265](https://github.com/vllm-project/vllm-metal/pull/265), [#194](https://github.com/vllm-project/vllm-metal/issues/194) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
-| Qwen3.6 | ✅ | Hybrid SDPA + GDN linear (MoE) | ❌ |  | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
+| Qwen3.6 | ✅ | Hybrid SDPA + GDN linear (MoE) | ❌ | [#312](https://github.com/vllm-project/vllm-metal/pull/312) | Verified on `Qwen/Qwen3.6-35B-A3B-FP8`. Per-expert MoE tensors stacked at sanitize. Upstream keeps automatic prefix caching off for hybrid/Mamba models |
 | Qwen3-Next | ✅ | Hybrid SDPA + GDN linear | ❌ | [#240](https://github.com/vllm-project/vllm-metal/pull/240) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
 | Gemma 4 | 🔵 | GQA + per-layer sliding window + YOCO | ✅ | [#251](https://github.com/vllm-project/vllm-metal/pull/251), [#260](https://github.com/vllm-project/vllm-metal/pull/260), [#269](https://github.com/vllm-project/vllm-metal/pull/269), [#275](https://github.com/vllm-project/vllm-metal/pull/275), [#277](https://github.com/vllm-project/vllm-metal/pull/277), [#278](https://github.com/vllm-project/vllm-metal/pull/278), [#282](https://github.com/vllm-project/vllm-metal/pull/282), [#276](https://github.com/vllm-project/vllm-metal/issues/276), [#279](https://github.com/vllm-project/vllm-metal/pull/279), [#281](https://github.com/vllm-project/vllm-metal/issues/281), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on for non-hybrid paged models; overall model support remains experimental |
 | Gemma 3 | ✅ | GQA (paged) | ✅ | [#283](https://github.com/vllm-project/vllm-metal/pull/283) | tested on gemma-3-1b-it-qat-4bit; gemma-3-4b-it-4bit verified for text-only generation with VLM image inputs bypassed |
diff --git a/tests/test_compat.py b/tests/test_compat.py
@@ -21,6 +21,8 @@ def _install_fake_qwen35_modules(monkeypatch, *, include_moe: bool):
     mlx_core.bfloat16 = np.float32
     mlx_core.from_fp8 = lambda weight, dtype=None: np.asarray(weight, dtype=np.float32)
     mlx_core.pad = lambda weight, pad_width: np.pad(weight, pad_width)
+    mlx_core.stack = lambda arrays, axis=0: np.stack(arrays, axis=axis)
+    mlx_core.concatenate = lambda arrays, axis=0: np.concatenate(arrays, axis=axis)
     mlx_pkg.core = mlx_core
     monkeypatch.setitem(sys.modules, "mlx", mlx_pkg)
     monkeypatch.setitem(sys.modules, "mlx.core", mlx_core)
@@ -175,6 +177,161 @@ def test_patches_higher_rank_weights_for_moe(self, monkeypatch) -> None:
         assert f"{gate_up_proj_prefix}.activation_scale" not in sanitized
         assert sanitized[f"{gate_up_proj_prefix}.weight"].shape == (2, 256, 128)
 
+    def test_per_expert_moe_tensors_stack_to_combined(self, monkeypatch) -> None:
+        # Qwen/Qwen3.6-35B-A3B-FP8 ships expert MLPs per-expert. The MoE
+        # sanitize wrapper must stack them along axis 0 and concatenate
+        # gate+up along the intermediate-dim axis, producing the combined
+        # form upstream sanitize already handles.
+        _, moe_module = _install_fake_qwen35_modules(monkeypatch, include_moe=True)
+        prefix = "model.language_model.layers.0.mlp.experts"
+
+        compat._patch_mlx_lm_qwen35_fp8_sanitize()
+
+        per_expert = {
+            f"{prefix}.0.gate_proj.weight": np.full((6, 4), 1.0),
+            f"{prefix}.0.up_proj.weight": np.full((6, 4), 2.0),
+            f"{prefix}.0.down_proj.weight": np.full((4, 6), 3.0),
+            f"{prefix}.1.gate_proj.weight": np.full((6, 4), 4.0),
+            f"{prefix}.1.up_proj.weight": np.full((6, 4), 5.0),
+            f"{prefix}.1.down_proj.weight": np.full((4, 6), 6.0),
+        }
+        sanitized = moe_module.Model().sanitize(per_expert)
+
+        gate_up_key = f"{prefix}.gate_up_proj"
+        down_key = f"{prefix}.down_proj"
+        assert gate_up_key in sanitized
+        assert down_key in sanitized
+        # gate_up: (num_experts, 2*intermediate, hidden); down: (num_experts, hidden, intermediate)
+        assert sanitized[gate_up_key].shape == (2, 12, 4)
+        assert sanitized[down_key].shape == (2, 4, 6)
+        # Per-expert keys must not leak through after stacking.
+        assert all(".experts.0." not in k for k in sanitized)
+        assert all(".experts.1." not in k for k in sanitized)
+        # Stacking preserves per-expert content along axis 0; gate occupies
+        # the first half of axis -2, up occupies the second half.
+        np.testing.assert_array_equal(
+            sanitized[gate_up_key][0, :6, :], np.full((6, 4), 1.0)
+        )
+        np.testing.assert_array_equal(
+            sanitized[gate_up_key][0, 6:, :], np.full((6, 4), 2.0)
+        )
+        np.testing.assert_array_equal(
+            sanitized[down_key][1, :, :], np.full((4, 6), 6.0)
+        )
+
+    def test_pre_stacked_moe_is_noop_for_per_expert_helper(self, monkeypatch) -> None:
+        # Pre-stacked checkpoints (mlx-community redistributions, Qwen3.6 bf16
+        # master) ship `experts.gate_up_proj` / `experts.down_proj` already
+        # combined. The per-expert helper must short-circuit and pass them
+        # through unchanged, leaving the combined-format branch in upstream
+        # sanitize free to do its split.
+        _, moe_module = _install_fake_qwen35_modules(monkeypatch, include_moe=True)
+        prefix = "model.language_model.layers.0.mlp.experts"
+
+        compat._patch_mlx_lm_qwen35_fp8_sanitize()
+
+        gate_up = np.arange(2 * 12 * 4, dtype=np.float32).reshape(2, 12, 4)
+        down = np.arange(2 * 4 * 6, dtype=np.float32).reshape(2, 4, 6)
+        weights = {
+            f"{prefix}.gate_up_proj": gate_up,
+            f"{prefix}.down_proj": down,
+        }
+        sanitized = moe_module.Model().sanitize(weights)
+
+        # Helper is a no-op: combined keys present unchanged, no per-expert
+        # keys appear.
+        np.testing.assert_array_equal(sanitized[f"{prefix}.gate_up_proj"], gate_up)
+        np.testing.assert_array_equal(sanitized[f"{prefix}.down_proj"], down)
+        assert not any(f"{prefix}.0." in k for k in sanitized)
+
+    def test_non_contiguous_per_expert_indices_raise(self, monkeypatch) -> None:
+        # Defensive: a malformed checkpoint shipping experts {0, 1, 3} (skipping
+        # 2) would silently drop expert 3 if the stacker walked indices in
+        # order. Helper must raise loudly so the user diagnoses the missing
+        # tensor instead of getting subtly wrong output.
+        _, moe_module = _install_fake_qwen35_modules(monkeypatch, include_moe=True)
+        prefix = "model.language_model.layers.0.mlp.experts"
+
+        compat._patch_mlx_lm_qwen35_fp8_sanitize()
+
+        gapped = {
+            f"{prefix}.0.gate_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.0.up_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.0.down_proj.weight": np.zeros((4, 6)),
+            f"{prefix}.1.gate_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.1.up_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.1.down_proj.weight": np.zeros((4, 6)),
+            f"{prefix}.3.gate_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.3.up_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.3.down_proj.weight": np.zeros((4, 6)),
+        }
+
+        with pytest.raises(ValueError, match="non-contiguous"):
+            moe_module.Model().sanitize(gapped)
+
+    def test_missing_projection_family_raises(self, monkeypatch) -> None:
+        # Defensive: a malformed checkpoint missing one entire projection
+        # family (e.g., no down_proj at all) must surface as a clear
+        # ValueError naming the missing family, rather than a raw KeyError
+        # leaking from the walk step. The same path also covers the case
+        # where only some experts have a given projection (mismatched index
+        # sets across families).
+        _, moe_module = _install_fake_qwen35_modules(monkeypatch, include_moe=True)
+        prefix = "model.language_model.layers.0.mlp.experts"
+
+        compat._patch_mlx_lm_qwen35_fp8_sanitize()
+
+        # 1) Entire down_proj family absent.
+        no_down = {
+            f"{prefix}.0.gate_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.0.up_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.1.gate_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.1.up_proj.weight": np.zeros((6, 4)),
+        }
+        with pytest.raises(ValueError, match="missing projection families"):
+            moe_module.Model().sanitize(no_down)
+
+        # 2) down_proj missing for one expert (mismatched index sets).
+        partial_down = {
+            f"{prefix}.0.gate_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.0.up_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.0.down_proj.weight": np.zeros((4, 6)),
+            f"{prefix}.1.gate_proj.weight": np.zeros((6, 4)),
+            f"{prefix}.1.up_proj.weight": np.zeros((6, 4)),
+            # missing f"{prefix}.1.down_proj.weight"
+        }
+        with pytest.raises(ValueError, match="mismatched down_proj"):
+            moe_module.Model().sanitize(partial_down)
+
+    def test_per_expert_helper_does_not_run_on_dense_qwen35(self, monkeypatch) -> None:
+        # The dense qwen3_5 patch wraps sanitize with FP8 dequant only — the
+        # per-expert stacking helper must NOT run on dense Qwen3.5/3.6
+        # checkpoints (no expert tensors exist in dense models, so even an
+        # accidental call would be a no-op, but the patch architecture
+        # makes the MoE-only nature explicit).
+        dense_module, _ = _install_fake_qwen35_modules(monkeypatch, include_moe=True)
+
+        compat._patch_mlx_lm_qwen35_fp8_sanitize()
+
+        # Dense weights with FP8 quant; no expert tensors anywhere.
+        sanitized = dense_module.Model().sanitize(
+            {
+                "model.language_model.layers.0.self_attn.q_proj.weight": np.ones(
+                    (128, 128)
+                ),
+                "model.language_model.layers.0.self_attn.q_proj.weight_scale_inv": np.ones(
+                    (1, 1)
+                ),
+            }
+        )
+        assert (
+            "model.language_model.layers.0.self_attn.q_proj.weight_scale_inv"
+            not in sanitized
+        )
+        assert sanitized[
+            "model.language_model.layers.0.self_attn.q_proj.weight"
+        ].shape == (128, 128)
+
 
 def _install_fake_gemma4_text_module(
     monkeypatch,
diff --git a/vllm_metal/compat.py b/vllm_metal/compat.py
@@ -131,6 +131,101 @@ def _dequantize_qwen35_fp8_weights(
     return new_weights
 
 
+def _stack_qwen36_moe_per_expert_weights(
+    weights: Mapping[str, Any], mx: Any
+) -> Mapping[str, Any]:
+    """Combine per-expert MoE tensors into the stacked layout mlx_lm expects.
+
+    ``Qwen/Qwen3.6-35B-A3B-FP8`` ships expert MLPs as one tensor per expert
+    per projection: ``...mlp.experts.{E}.{gate,up,down}_proj.weight``. The
+    bf16 master ``Qwen/Qwen3.6-35B-A3B`` is already pre-stacked and falls
+    through to the existing combined-format branch in
+    ``mlx_lm.qwen3_5_moe.sanitize`` unchanged. ``mlx_lm.qwen3_5_moe``'s
+    ``sanitize`` expects experts concatenated as
+    ``...mlp.experts.gate_up_proj`` (gate then up along the intermediate axis)
+    and ``...mlp.experts.down_proj``, both stacked along axis 0 over experts.
+
+    Mirrors the (scan -> validate -> walk) structure of upstream
+    ml-explore/mlx-lm#1224. Removable once vllm-metal's mlx-lm pin bumps
+    past that merge.
+
+    No-op when no per-expert keys are present (dense Qwen3.5/3.6 or already-
+    stacked MoE checkpoints).
+    """
+    experts_marker = ".mlp.experts."
+    proj_suffixes = (".gate_proj.weight", ".up_proj.weight", ".down_proj.weight")
+    # Scan: discover per-layer experts prefixes and per-projection index sets
+    # for all three projection families, so a checkpoint missing one family
+    # (or with a mismatched index set across families) fails validation
+    # cleanly instead of leaking a KeyError during the walk.
+    layer_proj_indices: dict[str, dict[str, set[int]]] = {}
+    for key in weights:
+        marker_pos = key.find(experts_marker)
+        if marker_pos == -1:
+            continue
+        suffix = next((s for s in proj_suffixes if key.endswith(s)), None)
+        if suffix is None:
+            continue
+        index_start = marker_pos + len(experts_marker)
+        index_end = len(key) - len(suffix)
+        tail = key[index_start:index_end]
+        if not tail.isdigit():
+            continue
+        prefix = key[: marker_pos + len(".mlp.experts")]
+        proj = suffix[1 : -len(".weight")]  # ".gate_proj.weight" -> "gate_proj"
+        layer_proj_indices.setdefault(prefix, {}).setdefault(proj, set()).add(int(tail))
+
+    if not layer_proj_indices:
+        return weights
+
+    logger.debug(
+        "Stacking per-expert MoE tensors at %d prefixes",
+        len(layer_proj_indices),
+    )
+    required_projs = ("gate_proj", "up_proj", "down_proj")
+    new_weights = dict(weights)
+    for prefix, proj_to_indices in layer_proj_indices.items():
+        # Validate: every prefix must have all three projection families, and
+        # all three must share the same contiguous {0..N-1} index set.
+        missing_projs = [p for p in required_projs if p not in proj_to_indices]
+        if missing_projs:
+            raise ValueError(
+                f"Per-expert MoE weights at {prefix!r} are missing "
+                f"projection families: {missing_projs}."
+            )
+        gate_indices = proj_to_indices["gate_proj"]
+        expected = set(range(len(gate_indices)))
+        if gate_indices != expected:
+            missing = sorted(expected - gate_indices)
+            extra = sorted(gate_indices - expected)
+            raise ValueError(
+                f"Per-expert MoE weights at {prefix!r} have "
+                f"non-contiguous gate_proj indices: missing={missing}, "
+                f"unexpected={extra}."
+            )
+        for proj in ("up_proj", "down_proj"):
+            if proj_to_indices[proj] != gate_indices:
+                missing = sorted(gate_indices - proj_to_indices[proj])
+                extra = sorted(proj_to_indices[proj] - gate_indices)
+                raise ValueError(
+                    f"Per-expert MoE weights at {prefix!r} have "
+                    f"mismatched {proj} indices vs gate_proj: "
+                    f"missing={missing}, unexpected={extra}."
+                )
+        # Walk: pop per-expert tensors in order, stack, and emit the combined
+        # form upstream sanitize already handles.
+        gates, ups, downs = [], [], []
+        for e in range(len(gate_indices)):
+            gates.append(new_weights.pop(f"{prefix}.{e}.gate_proj.weight"))
+            ups.append(new_weights.pop(f"{prefix}.{e}.up_proj.weight"))
+            downs.append(new_weights.pop(f"{prefix}.{e}.down_proj.weight"))
+        new_weights[f"{prefix}.gate_up_proj"] = mx.concatenate(
+            [mx.stack(gates), mx.stack(ups)], axis=-2
+        )
+        new_weights[f"{prefix}.down_proj"] = mx.stack(downs)
+    return new_weights
+
+
 def _patch_mlx_lm_qwen35_fp8_sanitize() -> None:
     """Teach mlx_lm's Qwen3.5 loaders to consume local FP8 ``weight_scale_inv``.
 
@@ -177,22 +272,43 @@ def _patch_mlx_lm_qwen35_fp8_sanitize() -> None:
         )
         return
 
-    def _patch_model_sanitize(model_cls) -> bool:
-        return _wrap_model_sanitize(
-            model_cls,
-            "_vllm_metal_qwen35_fp8_patch",
-            lambda _self, weights: _dequantize_qwen35_fp8_weights(weights, mx),
-        )
+    # qwen3_5 (dense) checkpoints only need FP8 dequant — they have no expert
+    # tensors to stack. Keep the dense patch narrow.
+    def _transform_dense(_self, weights):
+        return _dequantize_qwen35_fp8_weights(weights, mx)
+
+    # qwen3_5_moe (Qwen-org Qwen3.6-MoE FP8) needs FP8 dequant followed by
+    # per-expert stacking. The stacking step is the temporary downstream
+    # complement to ml-explore/mlx-lm#1224 and short-circuits when no
+    # per-expert keys are present.
+    def _transform_moe(_self, weights):
+        weights = _dequantize_qwen35_fp8_weights(weights, mx)
+        weights = _stack_qwen36_moe_per_expert_weights(weights, mx)
+        return weights
+
+    transforms_by_module: dict[str, Any] = {
+        "mlx_lm.models.qwen3_5": _transform_dense,
+        "mlx_lm.models.qwen3_5_moe": _transform_moe,
+    }
 
     patched_modules = []
     unpatchable_modules = []
     for module in model_modules:
+        short_name = module.__name__.rsplit(".", maxsplit=1)[-1]
         model_cls = getattr(module, "Model", None)
         if model_cls is None:
-            unpatchable_modules.append(module.__name__.rsplit(".", maxsplit=1)[-1])
+            unpatchable_modules.append(short_name)
             continue
-        if _patch_model_sanitize(model_cls):
-            patched_modules.append(module.__name__.rsplit(".", maxsplit=1)[-1])
+        transform = transforms_by_module.get(module.__name__)
+        if transform is None:
+            unpatchable_modules.append(short_name)
+            continue
+        if _wrap_model_sanitize(
+            model_cls,
+            "_vllm_metal_qwen35_fp8_patch",
+            transform,
+        ):
+            patched_modules.append(short_name)
     if patched_modules:
         logger.debug(
             "Patched mlx_lm %s FP8 sanitize compatibility",