[Qwen3.6] Stack per-expert MoE tensors during mlx_lm sanitize#312
Merged
LxYuan0420 merged 2 commits intovllm-project:mainfrom Apr 30, 2026
Merged
Conversation
903b1f2 to
89207df
Compare
LxYuan0420
reviewed
Apr 29, 2026
LxYuan0420
reviewed
Apr 29, 2026
LxYuan0420
reviewed
Apr 29, 2026
Qwen/Qwen3.6-35B-A3B-FP8 ships expert MLPs per-expert
(model.language_model.layers.{L}.mlp.experts.{E}.{gate,up,down}_proj.weight),
which mlx_lm.qwen3_5_moe.sanitize doesn't recognize. Loading on unpatched
mlx-lm fails strict load_weights with 30720 unexpected keys (256 experts
x 40 layers x 3 projections).
The bf16 master Qwen/Qwen3.6-35B-A3B is already pre-stacked
(experts.gate_up_proj / experts.down_proj) and loads via the existing
combined-format branch unchanged; only the FP8 release lands per-expert,
likely because Qwen's FP8 quantization pipeline runs per-expert.
Add _stack_qwen36_moe_per_expert_weights mirroring the (scan -> validate
-> walk) structure of ml-explore/mlx-lm#1224. Split the sanitize patch
into per-class transforms so the MoE-only nature of the stacking is
self-evident:
- mlx_lm.models.qwen3_5.Model -> _transform_dense (FP8 dequant only)
- mlx_lm.models.qwen3_5_moe.Model -> _transform_moe (FP8 dequant + stack)
mlx-community publishes pre-stacked redistributions, but converting
Qwen-org's FP8 release ourselves needs a 35GB->70GB bf16 intermediate
that doesn't fit on Macs <=64 GB. This shim lets users load the
canonical artifact directly. Removable once vllm-metal's mlx-lm pin
bumps past mlx-lm#1224.
Files:
- vllm_metal/compat.py: add _stack_qwen36_moe_per_expert_weights helper,
split sanitize patches by model class via transforms_by_module map.
- docs/supported_models.md: update Qwen3.6 row note + link this PR.
- tests/test_compat.py: 4 new unit tests covering positive (per-expert
-> combined), regression (pre-stacked no-op), defensive (gap raises),
and architecture invariant (dense path doesn't run MoE helper).
Verified end-to-end on Apple Silicon Metal: Qwen/Qwen3.6-35B-A3B-FP8
generates correctly via the new branch, Qwen/Qwen3.6-35B-A3B (bf16) via
the unchanged combined-format branch (both shims dormant - confirms the
new branch is properly gated). Existing Qwen3.5 golden-token smoke
(test_qwen35_smoke.py) unchanged: 5/5 pass.
Signed-off-by: Shivendra Dayal <sdayal@gmail.com>
89207df to
1a9622e
Compare
Contributor
Author
|
Thanks for the review. Pushed
Pre-submit clean: End-to-end re-verified on |
LxYuan0420
reviewed
Apr 30, 2026
Scan for gate_proj/up_proj/down_proj index sets together and raise a named ValueError on missing-family or mismatched-index cases, instead of leaking a KeyError from dict.pop during the walk step. Adds a focused test covering both flavors. Signed-off-by: Shivendra Dayal <sdayal@gmail.com>
6caa176 to
9e721c2
Compare
LxYuan0420
approved these changes
Apr 30, 2026
Collaborator
LxYuan0420
left a comment
There was a problem hiding this comment.
LGTM; TODO: remove it once mlx-lm#1224 lands in the pinned upstream version
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue: #289
Summary
Qwen/Qwen3.6-35B-A3B-FP8ships expert MLPs as one tensor per expert per projection:The bf16 master
Qwen/Qwen3.6-35B-A3Bis already pre-stacked (experts.gate_up_proj/experts.down_proj) and loads via the existing combined-format branch inmlx_lm.qwen3_5_moe.Model.sanitizeunchanged — only the FP8 release lands per-expert, likely because Qwen's FP8 quantization pipeline runs per-expert and the artifact is not re-stacked.On vllm-metal
main, loadingQwen/Qwen3.6-35B-A3B-FP8fails strictload_weightswithReceived 30720 parameters not in model— these are per-expert MoE tensors that vllm-metal's existing FP8 dequant compat doesn't address. (For reference: the same checkpoint on vanilla mlx-lm fails with 61,690 keys, the difference being 30,970 FP8weight_scale_invtensors that vllm-metal'scompat.py::_dequantize_qwen35_fp8_weightsalready handles separately.)What this PR does
Add
_stack_qwen36_moe_per_expert_weightschained after FP8 dequant in the MoE sanitize wrapper:{0, 1, …, N-1}(raisesValueErrorotherwise).mx.stackalong axis 0,mx.concatenategate+up along the intermediate-dim axis, emit the combinedexperts.gate_up_proj/experts.down_projform upstream sanitize already handles.Pre-stacked checkpoints are unaffected (helper short-circuits when no per-expert keys are present).
The MoE-only nature of the stacking is made explicit by splitting the sanitize patch by model class:
mlx_lm.models.qwen3_5.Model→ wrapped with FP8 dequant only (_transform_dense)mlx_lm.models.qwen3_5_moe.Model→ wrapped with FP8 dequant + per-expert stacking (_transform_moe)Routing is driven by an explicit
transforms_by_modulemap; future Qwen variants added without a corresponding entry get logged asunpatchablerather than silently inheriting one of the two transforms.Why a vllm-metal compat shim instead of waiting for upstream
mlx-community publishes pre-stacked redistributions of this checkpoint that already load on existing mlx-lm. This shim lets users load Qwen-org's canonical FP8 artifact directly without a 35GB→70GB bf16 intermediate conversion step that doesn't fit on memory-constrained Macs (≤64 GB).
This complements ml-explore/mlx-lm#1224, which adds the same per-expert stacking logic plus FP8
weight_scale_invdequant for the qwen3_5 family inline in upstream sanitize. Once mlx-lm#1224 lands and vllm-metal's mlx-lm pin bumps past a release containing it, both this PR's per-expert stacking shim and the existing_dequantize_qwen35_fp8_weightsshim incompat.pybecome removable in a follow-up cleanup.Files
vllm_metal/compat.py— add_stack_qwen36_moe_per_expert_weightshelper; split sanitize patches into per-class transforms viatransforms_by_module.docs/supported_models.md— update Qwen3.6 row note + link this PR.tests/test_compat.py— four new unit tests using the existing numpy-fake-mlx fixture (no real model weights, runs in milliseconds):test_per_expert_moe_tensors_stack_to_combined— positive: per-expert input produces correctly stacked combined output, content preserved per axis-0 slot.test_pre_stacked_moe_is_noop_for_per_expert_helper— regression: pre-stacked input passes through unchanged (covers mlx-community redistributions and Qwen3.6 bf16 master).test_non_contiguous_per_expert_indices_raise— defensive: malformed{0, 1, 3}checkpoint raisesValueError.test_per_expert_helper_does_not_run_on_dense_qwen35— architecture invariant: dense path doesn't run the MoE helper.Verification (per #289 pass bar)
Qwen/Qwen3.6-35B-A3B-FP8"The capital of France is"→" Paris, a city renowned for its iconic landmarks such"Qwen/Qwen3.6-35B-A3B(bf16)pytest tests/test_compat.py: 15 passed, 1 skipped (the skip is the pre-existingVLLM_METAL_RUN_REAL_MLX_FP8_TESTS=1-gated test).test_qwen35_smoke.py): 5/5 pass, unchanged.bash scripts/lint.sh: clean (shellcheck, ruff check, ruff format --check, mypy).Rebased on latest
main(3323d32) and re-validated against the bumped dep stack:mlx-lm 0.31.3(from #313),vllm 0.20.0+cpu(from #262),transformers 5.7.0.