Drop redundant lm_head AWQ quant triple in load_model by scyyh11 · Pull Request #1247 · ml-explore/mlx-lm

scyyh11 · 2026-05-06T00:41:01Z

Problem

load_model feeds AWQ/GPTQ checkpoints' *lm_head.{qweight,qzeros,scales} keys directly into _transform_awq_weights, which produces *lm_head.{weight,scales,biases}. When the source model is tied (lm_head is not a real parameter), AutoAWQ may still ship a redundant quant triple, and model.load_weights(strict=True) then rejects with Received parameters not in model.

The same failure surfaces under sanitizer prefix patterns:

Top-level lm_head.qweight becomes language_model.lm_head.qweight (Qwen2-VL/Qwen3-VL prefix every key under language_model.).
lm_head.qweight becomes model.lm_head.qweight (gpt2/gpt_neox re-namespace every key under model.).
model.language_model.lm_head.qweight becomes language_model.model.lm_head.qweight (gemma4/qwen3_5 nested text wrapper).

Fix

Add _maybe_drop_redundant_lm_head_awq_triple and call it in load_model between model.sanitize and the AWQ/GPTQ transform. Walk the constructed model's parameter tree to find prefixes at which an *lm_head.weight parameter target exists, and for any input *lm_head.{qweight,qzeros,scales} whose .weight target is absent, decide drop vs preserve via two model-level questions:

Walk the candidate's outer prefix (with one level of sanitizer-renamed fallback via _owner_aliases: language_model.model → language_model, model → "") to find the governing submodel. The extra model. segment is a sanitize-level renaming, not a new tying boundary.
At each level, consult the config fragment that drove the submodel's constructor (top-level config for ""/model, config["text_config"] for language_model*); fall back to the submodel's ModelArgs.tie_word_embeddings only when config is silent. The field's actual bool value is honoured (so qwen3_5.TextModelArgs.tie_word_embeddings = False defaulted-untied is not treated as authoritative tied just because the field exists).

Drop only on an authoritative tied signal; otherwise preserve and let strict load fail loudly. Architectures whose sanitize decides tying from weight content (gemma3_text, recurrent_gemma) or whose ModelArgs lacks the field (gpt2, gpt_neox) leave a missing target that is ambiguous between a tied checkpoint with a redundant triple and an untied checkpoint with a real quantized output head; without a config signal we cannot disambiguate, and silently dropping the latter would produce wrong logits.

This is architecture-agnostic and avoids touching every model's sanitize (29 currently). No publicly distributed Qwen/Llama/Mistral AWQ release on HF currently quantizes lm_head; this is defense-in-depth for third-party AWQ producers, and matches the failure mode that surfaced when integrating AutoAWQ checkpoints in vllm-metal.

Tests

16 unit tests in tests/test_utils.py covering:

Tied/untied Qwen2 with explicit and defaulted tie_word_embeddings.
Tied/untied Qwen2-VL with language_model.-prefixed keys; wrapper-level tie_word_embeddings: false + text_config.tie_word_embeddings: true resolves correctly to drop.
gpt2 with model.-prefixed keys: explicit-tied drops, silent config preserves (loud fail).
Tied/untied gemma4 with language_model.model.-prefixed keys.
Default-untied qwen3_5 (TextModelArgs.tie_word_embeddings = False default) with silent config: preserves, does not silently drop a real quantized output head.
gemma3_text and recurrent_gemma across explicit-tied / silent / explicit-untied configs.

python -m unittest discover -s tests -p "test_utils.py" → 22 passed.

AutoAWQ checkpoints can quantize lm_head even on tied-embedding models, where lm_head aliases embed_tokens and is not a real parameter. Each model's `sanitize` drops `lm_head.weight` for tied embeddings but is unaware of the AWQ quant triple, so `_transform_awq_weights` rounds it into `lm_head.{weight,scales,biases}` and `model.load_weights(..., strict=True)` rejects with "Received parameters not in model". Add `_maybe_drop_redundant_lm_head_awq_triple` and call it in `load_model` between sanitize and the AWQ/GPTQ transform. Detection walks the constructed model's parameter tree to find prefixes at which an `*lm_head.weight` parameter target exists, and considers any input `*lm_head.{qweight,qzeros,scales}` whose prefix is absent as a candidate for dropping. The drop decision asks two model-level questions instead of branching on each known sanitizer prefix: 1. Walk the candidate's outer prefix (and one level of sanitizer- renamed fallback via `_owner_aliases`) to find the governing submodel. `language_model.model` falls back to `language_model` (gemma4-style nested wrapper); `model` falls back to `""` (gpt2/gpt_neox transparent wrapper). The extra `model.` segment is a sanitize-level renaming, not a new tying boundary. 2. At each level, consult the config fragment that drove the submodel's constructor (top-level for `""`/`model`, `config["text_config"]` for `language_model`/ `language_model.model`); fall back to the submodel's `ModelArgs.tie_word_embeddings` only when config is silent. The field's actual value is honoured (True or False), not just its presence — a defaulted-False field (e.g., `qwen3_5.TextModelArgs.tie_word_embeddings = False`) does not get treated as authoritative tied just because the field exists. Otherwise preserve the triple. Architectures whose `sanitize` decides tying from weight content (gemma3_text, recurrent_gemma) or whose `ModelArgs` lacks the field (gpt2, gpt_neox) leave a missing `lm_head.weight` target that is ambiguous between a tied checkpoint with a redundant triple and an untied checkpoint with a real quantized output head; without a config signal we cannot distinguish, and silently dropping the latter would produce wrong logits — strict load failing loudly is preferable. Grounding the check in the parameter tree (rather than raw config keys alone) covers two cases the config alone cannot: * `tie_word_embeddings` defaulted to True by `ModelArgs` (Llama, Qwen2, etc.) when the checkpoint config omits the field. * Multimodal sanitizers that prefix text weights to `language_model.*` before this code runs, turning a top-level `lm_head.qweight` into `language_model.lm_head.qweight` (or `language_model.model. lm_head.qweight` for gemma4/qwen3_5). This is architecture-agnostic and avoids touching every model's `sanitize` (29 currently). No publicly distributed Qwen/Llama/Mistral AWQ release on HF currently quantizes `lm_head`; this is defense in depth for third-party AWQ producers and matches the failure mode that surfaced when integrating AutoAWQ checkpoints in vllm-metal. Tests cover: tied/untied Qwen2 with explicit and defaulted `tie_word_embeddings`; tied/untied Qwen2-VL with prefixed keys, plus the wrapper-vs-text_config conflict; gpt2 with `model.`-prefixed keys under explicit-tied / silent configs; tied/untied gemma4 with `language_model.model.`-prefixed keys; default-untied qwen3_5 with `language_model.model.`-prefixed keys under silent config (must preserve, not silently drop a real output head); gemma3_text + recurrent_gemma across explicit-tied / silent / explicit-untied configs. Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop redundant lm_head AWQ quant triple in load_model#1247

Drop redundant lm_head AWQ quant triple in load_model#1247
scyyh11 wants to merge 1 commit intoml-explore:mainfrom
scyyh11:fix-awq-tied-lm-head-drop

scyyh11 commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scyyh11 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scyyh11 commented May 6, 2026 •

edited

Loading