fix(embed): mark all tokens as output to suppress llama.cpp 'overriding' INFO (#2208) by Anai-Guo · Pull Request #2212 · abetlen/llama-cpp-python

Anai-Guo · 2026-05-10T04:21:15Z

Summary

In Llama.embed() (llama_cpp/llama.py:1043), logits_all was set based on pooling type:

logits_all = pooling_type == llama_cpp.LLAMA_POOLING_TYPE_NONE

Since LlamaBatch.add_sequence writes that value into per-token batch.logits[i], with pooling_type != NONE (the common case for sentence embeddings — MEAN, CLS) most tokens land with logits[i] = False and only the last is True. llama.cpp upstream then prints

init: embeddings required but some input tokens were not marked as outputs -> overriding

once per embedding input (because every Python-side embed() call decodes one sequence at a time through decode_batch) and silently overrides the flags to True.

Fix

Force logits_all = True in Llama.embed() and add a comment explaining why.

Pooling type only affects how per-token outputs are read back in decode_batch (llama_get_embeddings for NONE vs llama_get_embeddings_seq for pooled types) — it does not affect whether the per-token outputs need to be produced. llama.cpp already required all tokens to be marked when in embedding mode and was overriding the flags anyway, so this:

Makes the per-token flags match what llama.cpp needs in both pooling modes.
Removes the noisy override INFO message (one per embed() input).
Is behavior-preserving — embeddings produced for the same input remain identical.

Related upstream context: ollama/ollama#12381 mentioned in the bug report.

What changed

llama_cpp/llama.py: replace one line in Llama.embed() with a constant + a 6-line comment justifying it. No other call sites of add_sequence change.

Test plan

Construct Llama(..., embedding=True, pooling_type=LLAMA_POOLING_TYPE_MEAN).
Confirm model.embed(["a","b","c"]) no longer prints init: embeddings required ... -> overriding per input.
Confirm returned vectors are byte-identical to before for both LLAMA_POOLING_TYPE_NONE and LLAMA_POOLING_TYPE_MEAN (llama.cpp was already overriding internally).

🤖 Generated with Claude Code

…ng" INFO Force logits_all=True in Llama.embed() so per-token batch.logits[i] flags are all set, regardless of pooling type. Previously, when pooling != NONE, add_sequence flipped most tokens to logits[i]=False, and llama.cpp printed init: embeddings required but some input tokens were not marked as outputs -> overriding once per embed input and silently overrode the flags. Pooling type only changes how per-token outputs are read back in decode_batch (llama_get_embeddings vs llama_get_embeddings_seq), not whether they are produced — so this aligns the per-token flags with what llama.cpp already needed and removes the noisy per-input override message. Fixes abetlen#2208.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embed): mark all tokens as output to suppress llama.cpp 'overriding' INFO (#2208)#2212

fix(embed): mark all tokens as output to suppress llama.cpp 'overriding' INFO (#2208)#2212
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo:fix/embed-mark-all-tokens-as-output

Anai-Guo commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Anai-Guo commented May 10, 2026

Summary

Fix

What changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant