fix(embed): mark all tokens as output to suppress llama.cpp 'overriding' INFO (#2208)#2212
Open
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Open
fix(embed): mark all tokens as output to suppress llama.cpp 'overriding' INFO (#2208)#2212Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Conversation
…ng" INFO Force logits_all=True in Llama.embed() so per-token batch.logits[i] flags are all set, regardless of pooling type. Previously, when pooling != NONE, add_sequence flipped most tokens to logits[i]=False, and llama.cpp printed init: embeddings required but some input tokens were not marked as outputs -> overriding once per embed input and silently overrode the flags. Pooling type only changes how per-token outputs are read back in decode_batch (llama_get_embeddings vs llama_get_embeddings_seq), not whether they are produced — so this aligns the per-token flags with what llama.cpp already needed and removes the noisy per-input override message. Fixes abetlen#2208.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2208.
In
Llama.embed()(llama_cpp/llama.py:1043),logits_allwas set based on pooling type:Since
LlamaBatch.add_sequencewrites that value into per-tokenbatch.logits[i], withpooling_type != NONE(the common case for sentence embeddings — MEAN, CLS) most tokens land withlogits[i] = Falseand only the last isTrue. llama.cpp upstream then printsonce per embedding input (because every Python-side
embed()call decodes one sequence at a time throughdecode_batch) and silently overrides the flags toTrue.Fix
Force
logits_all = TrueinLlama.embed()and add a comment explaining why.Pooling type only affects how per-token outputs are read back in
decode_batch(llama_get_embeddingsfor NONE vsllama_get_embeddings_seqfor pooled types) — it does not affect whether the per-token outputs need to be produced. llama.cpp already required all tokens to be marked when in embedding mode and was overriding the flags anyway, so this:embed()input).Related upstream context: ollama/ollama#12381 mentioned in the bug report.
What changed
llama_cpp/llama.py: replace one line inLlama.embed()with a constant + a 6-line comment justifying it. No other call sites ofadd_sequencechange.Test plan
Llama(..., embedding=True, pooling_type=LLAMA_POOLING_TYPE_MEAN).model.embed(["a","b","c"])no longer printsinit: embeddings required ... -> overridingper input.LLAMA_POOLING_TYPE_NONEandLLAMA_POOLING_TYPE_MEAN(llama.cpp was already overriding internally).🤖 Generated with Claude Code