Skip to content

fix(embed): mark all tokens as output to suppress llama.cpp 'overriding' INFO (#2208)#2212

Open
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo:fix/embed-mark-all-tokens-as-output
Open

fix(embed): mark all tokens as output to suppress llama.cpp 'overriding' INFO (#2208)#2212
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo:fix/embed-mark-all-tokens-as-output

Conversation

@Anai-Guo
Copy link
Copy Markdown
Contributor

Summary

Fixes #2208.

In Llama.embed() (llama_cpp/llama.py:1043), logits_all was set based on pooling type:

logits_all = pooling_type == llama_cpp.LLAMA_POOLING_TYPE_NONE

Since LlamaBatch.add_sequence writes that value into per-token batch.logits[i], with pooling_type != NONE (the common case for sentence embeddings — MEAN, CLS) most tokens land with logits[i] = False and only the last is True. llama.cpp upstream then prints

init: embeddings required but some input tokens were not marked as outputs -> overriding

once per embedding input (because every Python-side embed() call decodes one sequence at a time through decode_batch) and silently overrides the flags to True.

Fix

Force logits_all = True in Llama.embed() and add a comment explaining why.

Pooling type only affects how per-token outputs are read back in decode_batch (llama_get_embeddings for NONE vs llama_get_embeddings_seq for pooled types) — it does not affect whether the per-token outputs need to be produced. llama.cpp already required all tokens to be marked when in embedding mode and was overriding the flags anyway, so this:

  • Makes the per-token flags match what llama.cpp needs in both pooling modes.
  • Removes the noisy override INFO message (one per embed() input).
  • Is behavior-preserving — embeddings produced for the same input remain identical.

Related upstream context: ollama/ollama#12381 mentioned in the bug report.

What changed

  • llama_cpp/llama.py: replace one line in Llama.embed() with a constant + a 6-line comment justifying it. No other call sites of add_sequence change.

Test plan

  • Construct Llama(..., embedding=True, pooling_type=LLAMA_POOLING_TYPE_MEAN).
  • Confirm model.embed(["a","b","c"]) no longer prints init: embeddings required ... -> overriding per input.
  • Confirm returned vectors are byte-identical to before for both LLAMA_POOLING_TYPE_NONE and LLAMA_POOLING_TYPE_MEAN (llama.cpp was already overriding internally).

🤖 Generated with Claude Code

…ng" INFO

Force logits_all=True in Llama.embed() so per-token batch.logits[i] flags are
all set, regardless of pooling type. Previously, when pooling != NONE,
add_sequence flipped most tokens to logits[i]=False, and llama.cpp printed

  init: embeddings required but some input tokens were not marked as outputs -> overriding

once per embed input and silently overrode the flags.

Pooling type only changes how per-token outputs are read back in decode_batch
(llama_get_embeddings vs llama_get_embeddings_seq), not whether they are
produced — so this aligns the per-token flags with what llama.cpp already
needed and removes the noisy per-input override message.

Fixes abetlen#2208.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Called model.embed generates INFO messages for each input

1 participant