Drop redundant lm_head AWQ quant triple in load_model#1247
Open
scyyh11 wants to merge 1 commit intoml-explore:mainfrom
Open
Drop redundant lm_head AWQ quant triple in load_model#1247scyyh11 wants to merge 1 commit intoml-explore:mainfrom
scyyh11 wants to merge 1 commit intoml-explore:mainfrom
Conversation
AutoAWQ checkpoints can quantize lm_head even on tied-embedding models,
where lm_head aliases embed_tokens and is not a real parameter. Each
model's `sanitize` drops `lm_head.weight` for tied embeddings but is
unaware of the AWQ quant triple, so `_transform_awq_weights` rounds it
into `lm_head.{weight,scales,biases}` and `model.load_weights(...,
strict=True)` rejects with "Received parameters not in model".
Add `_maybe_drop_redundant_lm_head_awq_triple` and call it in
`load_model` between sanitize and the AWQ/GPTQ transform. Detection
walks the constructed model's parameter tree to find prefixes at
which an `*lm_head.weight` parameter target exists, and considers any
input `*lm_head.{qweight,qzeros,scales}` whose prefix is absent as a
candidate for dropping.
The drop decision asks two model-level questions instead of branching
on each known sanitizer prefix:
1. Walk the candidate's outer prefix (and one level of sanitizer-
renamed fallback via `_owner_aliases`) to find the governing
submodel. `language_model.model` falls back to `language_model`
(gemma4-style nested wrapper); `model` falls back to `""`
(gpt2/gpt_neox transparent wrapper). The extra `model.` segment
is a sanitize-level renaming, not a new tying boundary.
2. At each level, consult the config fragment that drove the
submodel's constructor (top-level for `""`/`model`,
`config["text_config"]` for `language_model`/
`language_model.model`); fall back to the submodel's
`ModelArgs.tie_word_embeddings` only when config is silent. The
field's actual value is honoured (True or False), not just its
presence — a defaulted-False field (e.g.,
`qwen3_5.TextModelArgs.tie_word_embeddings = False`) does not
get treated as authoritative tied just because the field exists.
Otherwise preserve the triple. Architectures whose `sanitize`
decides tying from weight content (gemma3_text, recurrent_gemma) or
whose `ModelArgs` lacks the field (gpt2, gpt_neox) leave a missing
`lm_head.weight` target that is ambiguous between a tied checkpoint
with a redundant triple and an untied checkpoint with a real
quantized output head; without a config signal we cannot
distinguish, and silently dropping the latter would produce wrong
logits — strict load failing loudly is preferable.
Grounding the check in the parameter tree (rather than raw config
keys alone) covers two cases the config alone cannot:
* `tie_word_embeddings` defaulted to True by `ModelArgs` (Llama,
Qwen2, etc.) when the checkpoint config omits the field.
* Multimodal sanitizers that prefix text weights to `language_model.*`
before this code runs, turning a top-level `lm_head.qweight` into
`language_model.lm_head.qweight` (or `language_model.model.
lm_head.qweight` for gemma4/qwen3_5).
This is architecture-agnostic and avoids touching every model's
`sanitize` (29 currently). No publicly distributed Qwen/Llama/Mistral
AWQ release on HF currently quantizes `lm_head`; this is defense in
depth for third-party AWQ producers and matches the failure mode that
surfaced when integrating AutoAWQ checkpoints in vllm-metal.
Tests cover: tied/untied Qwen2 with explicit and defaulted
`tie_word_embeddings`; tied/untied Qwen2-VL with prefixed keys, plus
the wrapper-vs-text_config conflict; gpt2 with `model.`-prefixed keys
under explicit-tied / silent configs; tied/untied gemma4 with
`language_model.model.`-prefixed keys; default-untied qwen3_5 with
`language_model.model.`-prefixed keys under silent config (must
preserve, not silently drop a real output head); gemma3_text +
recurrent_gemma across explicit-tied / silent / explicit-untied
configs.
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
load_modelfeeds AWQ/GPTQ checkpoints'*lm_head.{qweight,qzeros,scales}keys directly into_transform_awq_weights, which produces*lm_head.{weight,scales,biases}. When the source model is tied (lm_head is not a real parameter), AutoAWQ may still ship a redundant quant triple, andmodel.load_weights(strict=True)then rejects withReceived parameters not in model.The same failure surfaces under sanitizer prefix patterns:
lm_head.qweightbecomeslanguage_model.lm_head.qweight(Qwen2-VL/Qwen3-VL prefix every key underlanguage_model.).lm_head.qweightbecomesmodel.lm_head.qweight(gpt2/gpt_neox re-namespace every key undermodel.).model.language_model.lm_head.qweightbecomeslanguage_model.model.lm_head.qweight(gemma4/qwen3_5 nested text wrapper).Fix
Add
_maybe_drop_redundant_lm_head_awq_tripleand call it inload_modelbetweenmodel.sanitizeand the AWQ/GPTQ transform. Walk the constructed model's parameter tree to find prefixes at which an*lm_head.weightparameter target exists, and for any input*lm_head.{qweight,qzeros,scales}whose.weighttarget is absent, decide drop vs preserve via two model-level questions:_owner_aliases:language_model.model→language_model,model→"") to find the governing submodel. The extramodel.segment is a sanitize-level renaming, not a new tying boundary.configfor""/model,config["text_config"]forlanguage_model*); fall back to the submodel'sModelArgs.tie_word_embeddingsonly when config is silent. The field's actual bool value is honoured (soqwen3_5.TextModelArgs.tie_word_embeddings = Falsedefaulted-untied is not treated as authoritative tied just because the field exists).Drop only on an authoritative tied signal; otherwise preserve and let strict load fail loudly. Architectures whose
sanitizedecides tying from weight content (gemma3_text, recurrent_gemma) or whoseModelArgslacks the field (gpt2, gpt_neox) leave a missing target that is ambiguous between a tied checkpoint with a redundant triple and an untied checkpoint with a real quantized output head; without a config signal we cannot disambiguate, and silently dropping the latter would produce wrong logits.This is architecture-agnostic and avoids touching every model's
sanitize(29 currently). No publicly distributed Qwen/Llama/Mistral AWQ release on HF currently quantizeslm_head; this is defense-in-depth for third-party AWQ producers, and matches the failure mode that surfaced when integrating AutoAWQ checkpoints in vllm-metal.Tests
16 unit tests in
tests/test_utils.pycovering:tie_word_embeddings.language_model.-prefixed keys; wrapper-leveltie_word_embeddings: false+text_config.tie_word_embeddings: trueresolves correctly to drop.model.-prefixed keys: explicit-tied drops, silent config preserves (loud fail).language_model.model.-prefixed keys.TextModelArgs.tie_word_embeddings = Falsedefault) with silent config: preserves, does not silently drop a real quantized output head.python -m unittest discover -s tests -p "test_utils.py"→ 22 passed.