Skip to content

Drop redundant lm_head AWQ quant triple in load_model#1247

Open
scyyh11 wants to merge 1 commit intoml-explore:mainfrom
scyyh11:fix-awq-tied-lm-head-drop
Open

Drop redundant lm_head AWQ quant triple in load_model#1247
scyyh11 wants to merge 1 commit intoml-explore:mainfrom
scyyh11:fix-awq-tied-lm-head-drop

Conversation

@scyyh11
Copy link
Copy Markdown

@scyyh11 scyyh11 commented May 6, 2026

Problem

load_model feeds AWQ/GPTQ checkpoints' *lm_head.{qweight,qzeros,scales} keys directly into _transform_awq_weights, which produces *lm_head.{weight,scales,biases}. When the source model is tied (lm_head is not a real parameter), AutoAWQ may still ship a redundant quant triple, and model.load_weights(strict=True) then rejects with Received parameters not in model.

The same failure surfaces under sanitizer prefix patterns:

  • Top-level lm_head.qweight becomes language_model.lm_head.qweight (Qwen2-VL/Qwen3-VL prefix every key under language_model.).
  • lm_head.qweight becomes model.lm_head.qweight (gpt2/gpt_neox re-namespace every key under model.).
  • model.language_model.lm_head.qweight becomes language_model.model.lm_head.qweight (gemma4/qwen3_5 nested text wrapper).

Fix

Add _maybe_drop_redundant_lm_head_awq_triple and call it in load_model between model.sanitize and the AWQ/GPTQ transform. Walk the constructed model's parameter tree to find prefixes at which an *lm_head.weight parameter target exists, and for any input *lm_head.{qweight,qzeros,scales} whose .weight target is absent, decide drop vs preserve via two model-level questions:

  1. Walk the candidate's outer prefix (with one level of sanitizer-renamed fallback via _owner_aliases: language_model.modellanguage_model, model"") to find the governing submodel. The extra model. segment is a sanitize-level renaming, not a new tying boundary.
  2. At each level, consult the config fragment that drove the submodel's constructor (top-level config for ""/model, config["text_config"] for language_model*); fall back to the submodel's ModelArgs.tie_word_embeddings only when config is silent. The field's actual bool value is honoured (so qwen3_5.TextModelArgs.tie_word_embeddings = False defaulted-untied is not treated as authoritative tied just because the field exists).

Drop only on an authoritative tied signal; otherwise preserve and let strict load fail loudly. Architectures whose sanitize decides tying from weight content (gemma3_text, recurrent_gemma) or whose ModelArgs lacks the field (gpt2, gpt_neox) leave a missing target that is ambiguous between a tied checkpoint with a redundant triple and an untied checkpoint with a real quantized output head; without a config signal we cannot disambiguate, and silently dropping the latter would produce wrong logits.

This is architecture-agnostic and avoids touching every model's sanitize (29 currently). No publicly distributed Qwen/Llama/Mistral AWQ release on HF currently quantizes lm_head; this is defense-in-depth for third-party AWQ producers, and matches the failure mode that surfaced when integrating AutoAWQ checkpoints in vllm-metal.

Tests

16 unit tests in tests/test_utils.py covering:

  • Tied/untied Qwen2 with explicit and defaulted tie_word_embeddings.
  • Tied/untied Qwen2-VL with language_model.-prefixed keys; wrapper-level tie_word_embeddings: false + text_config.tie_word_embeddings: true resolves correctly to drop.
  • gpt2 with model.-prefixed keys: explicit-tied drops, silent config preserves (loud fail).
  • Tied/untied gemma4 with language_model.model.-prefixed keys.
  • Default-untied qwen3_5 (TextModelArgs.tie_word_embeddings = False default) with silent config: preserves, does not silently drop a real quantized output head.
  • gemma3_text and recurrent_gemma across explicit-tied / silent / explicit-untied configs.

python -m unittest discover -s tests -p "test_utils.py" → 22 passed.

AutoAWQ checkpoints can quantize lm_head even on tied-embedding models,
where lm_head aliases embed_tokens and is not a real parameter. Each
model's `sanitize` drops `lm_head.weight` for tied embeddings but is
unaware of the AWQ quant triple, so `_transform_awq_weights` rounds it
into `lm_head.{weight,scales,biases}` and `model.load_weights(...,
strict=True)` rejects with "Received parameters not in model".

Add `_maybe_drop_redundant_lm_head_awq_triple` and call it in
`load_model` between sanitize and the AWQ/GPTQ transform. Detection
walks the constructed model's parameter tree to find prefixes at
which an `*lm_head.weight` parameter target exists, and considers any
input `*lm_head.{qweight,qzeros,scales}` whose prefix is absent as a
candidate for dropping.

The drop decision asks two model-level questions instead of branching
on each known sanitizer prefix:

1. Walk the candidate's outer prefix (and one level of sanitizer-
   renamed fallback via `_owner_aliases`) to find the governing
   submodel. `language_model.model` falls back to `language_model`
   (gemma4-style nested wrapper); `model` falls back to `""`
   (gpt2/gpt_neox transparent wrapper). The extra `model.` segment
   is a sanitize-level renaming, not a new tying boundary.
2. At each level, consult the config fragment that drove the
   submodel's constructor (top-level for `""`/`model`,
   `config["text_config"]` for `language_model`/
   `language_model.model`); fall back to the submodel's
   `ModelArgs.tie_word_embeddings` only when config is silent. The
   field's actual value is honoured (True or False), not just its
   presence — a defaulted-False field (e.g.,
   `qwen3_5.TextModelArgs.tie_word_embeddings = False`) does not
   get treated as authoritative tied just because the field exists.

Otherwise preserve the triple. Architectures whose `sanitize`
decides tying from weight content (gemma3_text, recurrent_gemma) or
whose `ModelArgs` lacks the field (gpt2, gpt_neox) leave a missing
`lm_head.weight` target that is ambiguous between a tied checkpoint
with a redundant triple and an untied checkpoint with a real
quantized output head; without a config signal we cannot
distinguish, and silently dropping the latter would produce wrong
logits — strict load failing loudly is preferable.

Grounding the check in the parameter tree (rather than raw config
keys alone) covers two cases the config alone cannot:

* `tie_word_embeddings` defaulted to True by `ModelArgs` (Llama,
  Qwen2, etc.) when the checkpoint config omits the field.
* Multimodal sanitizers that prefix text weights to `language_model.*`
  before this code runs, turning a top-level `lm_head.qweight` into
  `language_model.lm_head.qweight` (or `language_model.model.
  lm_head.qweight` for gemma4/qwen3_5).

This is architecture-agnostic and avoids touching every model's
`sanitize` (29 currently). No publicly distributed Qwen/Llama/Mistral
AWQ release on HF currently quantizes `lm_head`; this is defense in
depth for third-party AWQ producers and matches the failure mode that
surfaced when integrating AutoAWQ checkpoints in vllm-metal.

Tests cover: tied/untied Qwen2 with explicit and defaulted
`tie_word_embeddings`; tied/untied Qwen2-VL with prefixed keys, plus
the wrapper-vs-text_config conflict; gpt2 with `model.`-prefixed keys
under explicit-tied / silent configs; tied/untied gemma4 with
`language_model.model.`-prefixed keys; default-untied qwen3_5 with
`language_model.model.`-prefixed keys under silent config (must
preserve, not silently drop a real output head); gemma3_text +
recurrent_gemma across explicit-tied / silent / explicit-untied
configs.

Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant