Skip to content

[llm_bench] Add hooks for transformers v5#3696

Draft
sbalandi wants to merge 5 commits intoopenvinotoolkit:masterfrom
sbalandi:hook_tr5
Draft

[llm_bench] Add hooks for transformers v5#3696
sbalandi wants to merge 5 commits intoopenvinotoolkit:masterfrom
sbalandi:hook_tr5

Conversation

@sbalandi
Copy link
Copy Markdown
Contributor

Description

CVS-###

Fixes #(issue)

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation.

@sbalandi sbalandi requested a review from akashchi as a code owner April 13, 2026 12:50
Copilot AI review requested due to automatic review settings April 13, 2026 12:50
@github-actions github-actions Bot added category: llm_bench Label for tool/llm_bench folder category: GHA CI based on Github actions labels Apr 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates llm_bench’s Transformers-generation hooks to cover Transformers v5 while reorganizing existing v4.x hook implementations and extending CI to exercise WWB tests against newer Transformers versions.

Changes:

  • Replaced v4.x hook files named *_v43/v45/v51/v52/v55 with version-explicit *_v4_43/.../v4_57 modules, and added new v5 hook modules (hook_sample_v5.py, hook_sample_v5_3.py).
  • Updated llm_bench hook dispatch logic to select hook implementations based on transformers.__version__.
  • Updated Linux CI workflow to run WWB tests with Transformers 5.2.0 and 5.5.0.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v55.py Removed legacy v4.55 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v52.py Removed legacy v4.52 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v51.py Removed legacy v4.51 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v45.py Removed legacy v4.45 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v43.py Removed legacy v4.43 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_57.py Added v4.57 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_55.py Added v4.55 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_52.py Added v4.52 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_51.py Added v4.51 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_45.py Added v4.45 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_43.py Added v4.43 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v5.py Added Transformers v5.0–v5.2 sampling hook module.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v5_3.py Added Transformers v5.3+ sampling hook module.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v55.py Removed legacy v4.55 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v52.py Removed legacy v4.52 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v51.py Removed legacy v4.51 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v40.py Removed legacy v4.40 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_57.py Added v4.57 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_55.py Added v4.55 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_52.py Added v4.52 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_51.py Added v4.51 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_40.py Added v4.40 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/hook_greedy_search.py Updated version-based hook selection to include Transformers v5 hooks.
tools/llm_bench/llm_bench_utils/hook_beam_search.py Updated version-based beam-search hook selection to include Transformers v5 hooks.
.github/workflows/linux.yml Added WWB CI runs with Transformers 5.2.0 and 5.5.0.


type(model)._sample = hook_sample_v57.new_sample
type(model)._sample = hook_sample_v5_3.new_sample
if trans_version >= version.parse("5.0"):
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both if trans_version >= 5.3.0 and if trans_version >= 5.0 will run for Transformers >=5.3, so type(model)._sample gets overwritten and the v5.3+ hook is effectively never used. Change the second condition to elif (or otherwise ensure mutually exclusive branching) so Transformers >=5.3 uses hook_sample_v5_3.new_sample as intended.

Suggested change
if trans_version >= version.parse("5.0"):
elif trans_version >= version.parse("5.0"):

Copilot uses AI. Check for mistakes.
Comment on lines +390 to 394
elif trans_version >= version.parse("4.57.0"):
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_57 as hook_sample_v4_57

type(model)._sample = hook_sample_v4_57.new_sample
elif trans_version >= version.parse("4.55.0"):
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These branches import hook modules via tools.llm_bench..., while the rest of llm_bench imports use the llm_bench_utils... package (and the README examples run python benchmark.py from tools/llm_bench). Using the tools. prefix can raise ModuleNotFoundError depending on the working directory/PYTHONPATH. Please align these imports with the existing llm_bench_utils.llm_hook_sample... style for consistent runtime behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +18
if version.parse(transformers.__version__) >= version.parse("5.3.0"):
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5_3 as hook_beam_search_v5_3

new_beam_search = hook_beam_search_v57.new_beam_search_v57
new_beam_search = hook_beam_search_v5_3.new_beam_search
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hook_beam_search.py now imports version-specific implementations via tools.llm_bench.... Most of llm_bench is imported/run as the llm_bench_utils... package (e.g., from tools/llm_bench/benchmark.py), so this can break execution when the repo root isn’t on PYTHONPATH. Prefer the existing llm_bench_utils.llm_hook_beam_search... import style to keep the hook usable in the standard invocation flow.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +146
while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
if prefill_consumed:
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Transformers v5 sampling hook is described as collecting latency, but it never records per-iteration total time or forward/inference time into hook_greedy.tm_list / hook_greedy.tm_infer_list (unlike the v4.x hooks). As a result, llm_bench latency metrics will be empty/incorrect for Transformers v5. Please add timing collection around the forward call and the loop body, consistent with the existing v4.x hook implementations.

Copilot uses AI. Check for mistakes.
Comment on lines +144 to +148
model_inputs = self.prepare_inputs_for_generation(
input_ids, next_sequence_length=next_sequence_length, **model_kwargs
)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Transformers v5.3+ sampling hook doesn’t append any timings to hook_greedy.tm_list / hook_greedy.tm_infer_list, even though the hook’s purpose is latency collection. That will produce empty/incorrect benchmark latency outputs for Transformers v5.3+. Please add timing collection around the decode forward call and the full iteration, consistent with the v4.x hooks.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +128
if self.config._attn_implementation == "flash_attention_2" and getattr(
model_kwargs.get("past_key_values"), "is_compileable", False
):
if generation_config.compile_config is None:
generation_config.compile_config = CompileConfig(fullgraph=False)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompileConfig is referenced when generation_config.compile_config is None, but it is not imported in this module. If auto-compile is enabled, this will raise a NameError. Import CompileConfig from the appropriate Transformers module (matching the upstream version you copied from) or avoid constructing it here.

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/linux.yml Outdated
Comment on lines +620 to +624
- name: 'WWB tests'
cmd: 'python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"'
cmd: |
python -m pip install transformers==5.2.0
python -m pip show transformers
python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are now two matrix entries with the identical name WWB tests, which will show up as duplicate job names in CI and make failures harder to triage. Consider renaming them to include the Transformers version (e.g., WWB tests (transformers 5.2.0) / WWB tests (transformers 5.5.0)).

Copilot uses AI. Check for mistakes.
@@ -618,7 +618,17 @@ jobs:
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).RAG.test }}
timeout: 30
- name: 'WWB tests'
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description still contains placeholders (e.g., CVS-###, Fixes #(issue)) and the checklist is not filled out. Please update the PR description to match the repository PR template/checklist before merging so reviewers can validate scope, tests, and docs updates.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 31 changed files in this pull request and generated 7 comments.

Comment on lines +391 to +413
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_57 as hook_sample_v4_57

type(model)._sample = hook_sample_v4_57.new_sample
elif trans_version >= version.parse("4.55.0"):
import llm_bench_utils.llm_hook_sample.hook_sample_v55 as hook_sample_v55
model._sample = hook_sample_v55.new_sample.__get__(model, model.__class__)
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_55 as hook_sample_v4_55

model._sample = hook_sample_v4_55.new_sample.__get__(model, model.__class__)
elif trans_version >= version.parse('4.52.0'):
import llm_bench_utils.llm_hook_sample.hook_sample_v52 as hook_sample_v52
model._sample = hook_sample_v52.new_sample.__get__(model, model.__class__)
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_52 as hook_sample_v4_52

model._sample = hook_sample_v4_52.new_sample.__get__(model, model.__class__)
elif trans_version >= version.parse('4.51.0'):
import llm_bench_utils.llm_hook_sample.hook_sample_v51 as hook_sample_v51
model._sample = hook_sample_v51.new_sample.__get__(model, model.__class__)
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_51 as hook_sample_v4_51

model._sample = hook_sample_v4_51.new_sample.__get__(model, model.__class__)
elif trans_version >= version.parse('4.45.0'):
import llm_bench_utils.llm_hook_sample.hook_sample_v45 as hook_sample_v45
model._sample = hook_sample_v45.new_sample.__get__(model, model.__class__)
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_45 as hook_sample_v4_45

model._sample = hook_sample_v4_45.new_sample.__get__(model, model.__class__)
elif trans_version >= version.parse('4.43.0'):
import llm_bench_utils.llm_hook_sample.hook_sample_v43 as hook_sample_v43
model._sample = hook_sample_v43.new_sample.__get__(model, model.__class__)
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_43 as hook_sample_v4_43

model._sample = hook_sample_v4_43.new_sample.__get__(model, model.__class__)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports use the fully-qualified module path starting with tools.llm_bench..., but tools/llm_bench/README.md instructs running the tool as python benchmark.py from inside tools/llm_bench/. In that execution mode, the repo root (and therefore the tools package) is not on sys.path, so these imports will fail. Prefer importing from llm_bench_utils... consistently (as is done elsewhere in this file) or explicitly ensure the repo root is on PYTHONPATH before using tools.* imports.

Copilot uses AI. Check for mistakes.
Comment on lines 2 to 3
transformers[sentencepiece]>=4.35.2,<5.4.0
sentence-transformers>=2.2.2,<=5.3.0
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file still pins transformers[sentencepiece] to <5.4.0, but CI is now explicitly testing Transformers 5.5 for WWB. Please align the declared dependency range with the versions you intend to support/test (either relax the upper bound or drop the 5.5 CI run).

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +160
while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
if prefill_consumed:
next_sequence_length = 1 if model_kwargs["use_cache"] else None
model_inputs = self.prepare_inputs_for_generation(
input_ids, next_sequence_length=next_sequence_length, **model_kwargs
)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
prefill_consumed = True
model_kwargs = self._update_model_kwargs_for_generation(
outputs,
model_kwargs,
is_encoder_decoder=self.config.is_encoder_decoder,
)
if synced_gpus and this_peer_finished:
continue

# Copy is needed to avoid keeping a hanging ref to outputs.logits which may be very large for first iteration
# (the clone itself is always small)
next_token_logits = outputs.logits[:, -1, :].to(copy=True, dtype=torch.float32, device=input_ids.device)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as hook_sample_v5: this v5.3+ hook imports time and llm_bench_utils.hook_greedy_search but does not record any timings. If llm_bench relies on _sample timing for do_sample runs, please add timing collection (total step + forward-pass time) consistent with the v4.x hook implementations.

Copilot uses AI. Check for mistakes.
Comment on lines +382 to +390
if trans_version >= version.parse("5.3.0"):
import llm_bench_utils.llm_hook_sample.hook_sample_v5_3 as hook_sample_v5_3

type(model)._sample = hook_sample_v57.new_sample
type(model)._sample = hook_sample_v5_3.new_sample
if trans_version >= version.parse("5.0"):
import llm_bench_utils.llm_hook_sample.hook_sample_v5 as hook_sample_v5

type(model)._sample = hook_sample_v5.new_sample
elif trans_version >= version.parse("4.57.0"):
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Transformers >= 5.3.0 this code sets _sample twice because the second condition is an if, not an elif. As written, 5.3+ will first bind hook_sample_v5_3.new_sample and then immediately overwrite it with hook_sample_v5.new_sample, so the v5.3-specific hook is never used. Convert this to a single if/elif/... chain (or match) so only one implementation is selected per version.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +42
if version.parse(transformers.__version__) >= version.parse("5.3.0"):
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5_3 as hook_beam_search_v5_3

new_beam_search = hook_beam_search_v57.new_beam_search_v57
new_beam_search = hook_beam_search_v5_3.new_beam_search
elif version.parse(transformers.__version__) >= version.parse("5.0"):
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5 as hook_beam_search_v5

new_beam_search = hook_beam_search_v5.new_beam_search
elif version.parse(transformers.__version__) >= version.parse("4.57.0"):
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_57 as hook_beam_search_v4_57

new_beam_search = hook_beam_search_v4_57.new_beam_search_v57
elif version.parse(transformers.__version__) >= version.parse("4.55.0"):
import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v55 as hook_beam_search_v55
new_beam_search = hook_beam_search_v55.new_beam_search_v55
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_55 as hook_beam_search_v4_55

new_beam_search = hook_beam_search_v4_55.new_beam_search_v55
elif version.parse(transformers.__version__) >= version.parse("4.52.0"):
import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v52 as hook_beam_search_v52
new_beam_search = hook_beam_search_v52.new_beam_search_v52
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_52 as hook_beam_search_v4_52

new_beam_search = hook_beam_search_v4_52.new_beam_search_v52
elif version.parse(transformers.__version__) >= version.parse("4.51.0"):
import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v51 as hook_beam_search_v51
new_beam_search = hook_beam_search_v51.new_beam_search_v51
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_51 as hook_beam_search_v4_51

new_beam_search = hook_beam_search_v4_51.new_beam_search_v51
else:
import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v40 as hook_beam_search_v40
new_beam_search = hook_beam_search_v40.new_beam_search_v40
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_40 as hook_beam_search_v4_40

new_beam_search = hook_beam_search_v4_40.new_beam_search_v40
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these imports use tools.llm_bench... module paths. However, tools/llm_bench/README.md documents running the benchmark as python benchmark.py from within tools/llm_bench/, which does not put the repo root on sys.path. In that common invocation, import tools... will raise ModuleNotFoundError. Use llm_bench_utils.llm_hook_beam_search... imports (consistent with the rest of llm_bench) or adjust the entrypoint so the repo root is on PYTHONPATH.

Copilot uses AI. Check for mistakes.
@@ -10,7 +10,8 @@ torch>=2.1.0,<=2.11
transformers[sentencepiece]>=4.40.0,<5.4.0
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file still constrains transformers[sentencepiece] to <5.4.0, but the PR adds hooks/CI for Transformers v5.5 (see workflow changes) and the new hook module comments claim support beyond 5.4. If the intent is to support 5.5, update the version constraint accordingly; otherwise, CI should not test versions outside the declared range.

Suggested change
transformers[sentencepiece]>=4.40.0,<5.4.0
transformers[sentencepiece]>=4.40.0,<5.6.0

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +150
while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
if prefill_consumed:
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
prefill_consumed = True
model_kwargs = self._update_model_kwargs_for_generation(
outputs,
model_kwargs,
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new _sample hooks claim to add latency collection, but there is no timing instrumentation (no time.perf_counter() usage and no writes to hook_greedy.tm_list/tm_infer_list). This looks like a regression compared to the previous v4.x hook implementations where per-step and infer timings were recorded. If sampling latency is still needed for llm_bench metrics, add the same timing updates inside the generation loop and around the model forward call.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 15, 2026 11:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 31 changed files in this pull request and generated 6 comments.

Comment thread tests/python_tests/requirements.txt Outdated
--extra-index-url https://download.pytorch.org/whl/cpu
diffusers==0.37.1
https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel
optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes CI/test installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version.

Suggested change
optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
optimum-intel==1.25.2

Copilot uses AI. Check for mistakes.
Comment thread samples/export-requirements.txt Outdated
Comment on lines +4 to +5
# https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel
optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version.

Suggested change
# https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel
optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
optimum-intel @ https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +6
# optimum-intel[nncf]>=1.19.0,<=1.27.0
optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version (as is already done elsewhere in this workflow for optimum-intel).

Copilot uses AI. Check for mistakes.
Comment on lines 620 to 632
- name: 'WWB tests'
cmd: 'python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"'
cmd: |
python -m pip install transformers==5.2.0 git+https://github.com/huggingface/optimum-intel.git@transformers-v5
python -m pip show transformers
python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }}
timeout: 120
- name: 'WWB tests'
cmd: |
python -m pip install transformers==5.5.0 git+https://github.com/huggingface/optimum-intel.git@transformers-v5
python -m pip show transformers
python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"
run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix now has two entries both named 'WWB tests', which makes CI logs ambiguous. Consider naming them distinctly (e.g., include the transformers version). Also, this installs transformers==5.5.0 while the WWB requirements file currently constrains transformers to <5.4.0; please align the declared constraints with what CI is testing (or vice versa) to avoid confusion about supported versions.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +149
# Transformers version: v5.3.0, v5.4.0, v5.5.0, v5.5.1, v5.5.2, v5.5.3
# https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/generation/utils.py#L3068
# Add the function of collecting latency
def new_sample(
self,
input_ids: torch.LongTensor,
logits_processor: LogitsProcessorList,
stopping_criteria: StoppingCriteriaList,
generation_config: GenerationConfig,
synced_gpus: bool = False,
streamer: Optional["BaseStreamer"] = None,
**model_kwargs,
) -> GenerateNonBeamOutput | torch.LongTensor:
r"""
Generates sequences of token ids for models with a language modeling head using **multinomial sampling** and
can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Parameters:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
The sequence used as a prompt for the generation.
logits_processor (`LogitsProcessorList`):
An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
used to modify the prediction scores of the language modeling head applied at each generation step.
stopping_criteria (`StoppingCriteriaList`):
An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`]
used to tell if the generation loop should stop.
generation_config ([`~generation.GenerationConfig`]):
The generation configuration to be used as parametrization of the decoding method.
synced_gpus (`bool`):
Whether to continue running the while loop until max_length (needed to avoid deadlocking with
`FullyShardedDataParallel` and DeepSpeed ZeRO Stage 3).
streamer (`BaseStreamer`, *optional*):
Streamer object that will be used to stream the generated sequences. Generated tokens are passed
through `streamer.put(token_ids)` and the streamer is responsible for any further processing.
model_kwargs:
Additional model specific kwargs will be forwarded to the `forward` function of the model. If model is
an encoder-decoder model the kwargs should include `encoder_outputs`.

Return:
[`~generation.GenerateDecoderOnlyOutput`], [`~generation.GenerateEncoderDecoderOutput`] or `torch.LongTensor`:
A `torch.LongTensor` containing the generated tokens (default behaviour) or a
[`~generation.GenerateDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and
`return_dict_in_generate=True` or a [`~generation.GenerateEncoderDecoderOutput`] if
`model.config.is_encoder_decoder=True`.
"""
# init values
pad_token_id = generation_config._pad_token_tensor
output_attentions = generation_config.output_attentions
output_hidden_states = generation_config.output_hidden_states
output_scores = generation_config.output_scores
output_logits = generation_config.output_logits
return_dict_in_generate = generation_config.return_dict_in_generate
has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria)
do_sample = generation_config.do_sample

# init attention / hidden states / scores tuples
scores = () if (return_dict_in_generate and output_scores) else None
raw_logits = () if (return_dict_in_generate and output_logits) else None
decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
cross_attentions = () if (return_dict_in_generate and output_attentions) else None
decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None

# if model is an encoder-decoder, retrieve encoder attention weights and hidden states
if return_dict_in_generate and self.config.is_encoder_decoder:
encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
encoder_hidden_states = model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None

# keep track of which sequences are already finished
batch_size = input_ids.shape[0]
this_peer_finished = False
unfinished_sequences = torch.ones(batch_size, dtype=torch.long, device=input_ids.device)

model_forward = (
self.get_compiled_call(generation_config.compile_config)
if self._valid_auto_compile_criteria(model_kwargs, generation_config)
else self.__call__
)

prefill_consumed = False
outputs = self._prefill(
input_ids,
generation_config,
model_kwargs,
is_first_iteration=not generation_config.is_assistant,
)

while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
if prefill_consumed:
next_sequence_length = 1 if model_kwargs["use_cache"] else None
model_inputs = self.prepare_inputs_for_generation(
input_ids, next_sequence_length=next_sequence_length, **model_kwargs
)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
prefill_consumed = True
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hook is documented as adding latency collection, but unlike the existing v4.x hooks it never records per-iteration timing (e.g., updating hook_greedy.tm_infer_list / hook_greedy.tm_list). As a result, llm_bench will not report token/inference latency for transformers v5.3+ paths. Please add the same timing instrumentation used in hook_sample_v4_57.py around the model forward pass and the full loop iteration.

Copilot uses AI. Check for mistakes.
Comment thread tools/llm_bench/requirements.txt Outdated
# optimum is in dependency list of optimum-intel
optimum-intel[nncf]>=1.25.0,<=1.27.0
# optimum-intel[nncf]>=1.25.0,<=1.27.0
optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version.

Suggested change
optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
optimum-intel==1.27.0

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 15, 2026 12:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 31 changed files in this pull request and generated 1 comment.

Comment on lines +127 to +156
model_forward = (
self.get_compiled_call(generation_config.compile_config)
if self._valid_auto_compile_criteria(model_kwargs, generation_config)
else self.__call__
)

prefill_consumed = False
outputs = self._prefill(
input_ids,
generation_config,
model_kwargs,
is_first_iteration=not generation_config.is_assistant,
)

while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
if prefill_consumed:
next_sequence_length = 1 if model_kwargs["use_cache"] else None
model_inputs = self.prepare_inputs_for_generation(
input_ids, next_sequence_length=next_sequence_length, **model_kwargs
)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
prefill_consumed = True
model_kwargs = self._update_model_kwargs_for_generation(
outputs,
model_kwargs,
is_encoder_decoder=self.config.is_encoder_decoder,
)
if synced_gpus and this_peer_finished:
continue
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v5.3+ sampling hook is supposed to collect per-step latency (as done in v4.x hooks), but this implementation never appends to hook_greedy.tm_infer_list / hook_greedy.tm_list and therefore breaks latency reporting for do_sample in llm_bench. Add the same timing instrumentation around the decode forward pass and per-iteration loop used in the v4.x hook implementations.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 20, 2026 16:43
@github-actions github-actions Bot added the category: whisper Whisper pipeline label Apr 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 39 changed files in this pull request and generated 3 comments.

Comment thread .github/workflows/linux.yml Outdated
Comment on lines +1071 to +1075
- name: Install dependencies
uses: ./src/.github/actions/install_wheel
with:
packages: openvino;openvino_tokenizers[transformers]
requirements_files: "${{ env.SRC_DIR }}/tests/python_tests/requirements.txt"
local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels
run: |
python -m pip install -r ${{ env.SRC_DIR }}/tests/python_tests/requirements.txt
python -m pip install openvino openvino_tokenizers[transformers] --extra-index-url=https://storage.openvinotoolkit.org/simple/wheels/nightly
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step defines both uses: and run: in the same step, which is invalid GitHub Actions YAML and will fail workflow parsing. Convert this step to either a composite action step (uses + with) or a shell step (run) but not both.

Copilot uses AI. Check for mistakes.
timeout: 60
- name: 'VLM (qwen3-vl)'
cmd: |
python -m pip install transformers==4.57.0
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The qwen3-vl job installs transformers==4.57.0 twice (first alone, then again with the optimum-intel git dependency). Drop the redundant install to reduce time and avoid potential resolver churn.

Suggested change
python -m pip install transformers==4.57.0

Copilot uses AI. Check for mistakes.
Comment on lines +482 to 485
@pytest.mark.transformers_lower_v5
@pytest.mark.parametrize("model_descr", get_whisper_models_list(tiny_only=True))
@pytest.mark.parametrize("sample_from_dataset", [{"language": "en", "sample_id": 1}], indirect=True)
@pytest.mark.xfail(condition=(sys.platform == "darwin"), reason="Ticket - 173169")
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytest.mark.transformers_lower_v5 is newly introduced but is not registered in tests/python_tests/pytest.ini markers list. Add it there (or reuse an existing marker) to avoid unknown-marker warnings/errors in CI (especially if --strict-markers is enabled).

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 21, 2026 18:00
@github-actions github-actions Bot added category: visual language Visual language pipeline category: tokenizers Tokenizer class or submodule update category: LLM samples GenAI LLM samples no-match-files category: GH Pages Docs Github Pages documentation category: VLM samples GenAI VLM samples category: Video generation samples labels Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 60 out of 62 changed files in this pull request and generated 8 comments.

Comment on lines +185 to 190
if hf_generation_config.num_return_sequences is None:
num_return_sequences = 1

for idx, hf_encoded_out in enumerate(hf_encoded_outputs.sequences):
prompt_idx = idx // hf_generation_config.num_return_sequences
prompt_idx = idx // num_return_sequences
prompt_len = 0 if generation_configs.echo else input_ids[prompt_idx].numel()
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_return_sequences is only set when hf_generation_config.num_return_sequences is None, but it’s used unconditionally in the loop below. When num_return_sequences is not None, this will raise UnboundLocalError. Initialize it to hf_generation_config.num_return_sequences or 1 before the loop.

Copilot uses AI. Check for mistakes.
Comment on lines +382 to +390
if trans_version >= version.parse("5.3.0"):
import llm_bench_utils.llm_hook_sample.hook_sample_v5_3 as hook_sample_v5_3

type(model)._sample = hook_sample_v57.new_sample
type(model)._sample = hook_sample_v5_3.new_sample
if trans_version >= version.parse("5.0"):
import llm_bench_utils.llm_hook_sample.hook_sample_v5 as hook_sample_v5

type(model)._sample = hook_sample_v5.new_sample
elif trans_version >= version.parse("4.57.0"):
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version checks are not mutually exclusive: for transformers >= 5.3, the first if sets _sample to v5_3, but the following if trans_version >= 5.0 then overwrites it with v5. This makes the v5.3 hook unreachable. Change the second if to elif (or otherwise ensure a single branch runs).

Copilot uses AI. Check for mistakes.
Comment thread tests/python_tests/test_vlm_pipeline.py Outdated
if is_transformers_version("<", "5.0"):
processor.audio_tokenizer = None
else:
processor.pop("audio_tokenizer", None)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processor is a transformers ProcessorMixin instance, not a dict, so processor.pop("audio_tokenizer", None) will raise AttributeError. If you need to remove this field for transformers>=5, use delattr(processor, "audio_tokenizer") (guarded with hasattr) or set it to None consistently.

Suggested change
processor.pop("audio_tokenizer", None)
if hasattr(processor, "audio_tokenizer"):
delattr(processor, "audio_tokenizer")

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +220
if Version(transformers_version) < "5.0.0":
VISUAL_VIDEO_TEXT_MODELS += (("optimum-intel-internal-testing/tiny-random-llava-next-video", "visual-video-text"),)

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version(transformers_version) < "5.0.0" compares a Version object to a string, which raises TypeError in packaging. Compare against Version("5.0.0") (or use packaging.version.parse) so the conditional works at runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +146
def get_hf_cli_command() -> str:
# huggingface_hub < 1.0 supports huggingface-cli, huggingface_hub >= 1.0 use hf
version = importlib.metadata.version("huggingface_hub")
major = int(version.split(".", 1)[0])
if major >= 1:
return "hf"
return "huggingface-cli"
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importlib.metadata isn’t imported here (import importlib alone doesn’t load the metadata submodule), so importlib.metadata.version(...) will raise AttributeError. Use the already-imported metadata (metadata.version(...)) or import importlib.metadata / from importlib import metadata.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +43
def get_hf_cli_command() -> str:
# huggingface_hub < 1.0 supports huggingface-cli, huggingface_hub >= 1.0 use hf
version = importlib.metadata.version("huggingface_hub")
major = int(version.split(".", 1)[0])
if major >= 1:
return "hf"
return "huggingface-cli"
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importlib.metadata is not imported (only import importlib), so importlib.metadata.version(...) will fail at runtime. Import importlib.metadata / from importlib import metadata, or use metadata.version(...).

Copilot uses AI. Check for mistakes.
Comment thread tests/python_tests/utils/constants.py Outdated
Comment on lines +27 to +30
if (
is_transformers_version(">=", "5.0")
and generation_config is not None
and generation_config.get("num_beam_groups", 1) > 1
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_generate_kwargs() treats generation_config as a dict (generation_config.get(...)), but current call sites pass a transformers.GenerationConfig instance. This will raise AttributeError. Either accept HFGenerationConfig and use getattr(generation_config, "num_beam_groups", 1), or pass generation_config.to_dict() from the caller.

Suggested change
if (
is_transformers_version(">=", "5.0")
and generation_config is not None
and generation_config.get("num_beam_groups", 1) > 1
num_beam_groups = 1
if generation_config is not None:
if isinstance(generation_config, dict):
num_beam_groups = generation_config.get("num_beam_groups", 1)
else:
num_beam_groups = getattr(generation_config, "num_beam_groups", 1)
if (
is_transformers_version(">=", "5.0")
and generation_config is not None
and num_beam_groups > 1

Copilot uses AI. Check for mistakes.
Comment on lines 134 to 140
generate_outputs = opt_model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
generation_config=hf_generation_config,
tokenizer=hf_tokenizer,
**extra_generate_kwargs(),
**extra_generate_kwargs(hf_generation_config),
)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_generate_kwargs(hf_generation_config) passes a transformers.GenerationConfig object, but extra_generate_kwargs currently expects a dict-like object (uses .get). This will raise at runtime; either pass hf_generation_config.to_dict() here or change extra_generate_kwargs to handle GenerationConfig.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 21, 2026 22:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 61 out of 63 changed files in this pull request and generated 7 comments.

Comment thread .github/workflows/linux.yml Outdated
Comment on lines +700 to +705
- name: Install GenAI Wheels
if: ${{ matrix.test.run_condition }}
uses: ./src/.github/actions/install_wheel
with:
packages: "openvino;openvino_tokenizers[transformers];openvino_genai;whowhatbench"
requirements_files: "${{ env.SRC_DIR }}/tests/python_tests/requirements.txt"
local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels

run: |
python -m pip install ${{ env.INSTALL_DIR }}/wheels/whowhatbench-*.whl --extra-index-url https://download.pytorch.org/whl/cpu
python -m pip install -r ${{ env.SRC_DIR }}/tests/python_tests/requirements.txt
python -m pip install openvino openvino_tokenizers[transformers] openvino_genai[testing] --extra-index-url=https://storage.openvinotoolkit.org/simple/wheels/nightly
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description still contains placeholders (CVS-###, Fixes #(issue)) and the checklist is entirely unchecked. Please update the description to match the PR template (real ticket/issue links, completed checklist, and a brief summary of test coverage for the transformers v5 hooks).

Copilot uses AI. Check for mistakes.
Comment on lines +683 to 684
assert False
llm_pipe = openvino_genai.VLMPipeline(model_path, device.upper(), **ov_config)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert False will unconditionally abort VLM pipeline creation, so llm_bench cannot benchmark GenAI VLMs. Please remove this debug assertion (or gate it behind an explicit debug flag) so VLMPipeline(...) can be constructed.

Copilot uses AI. Check for mistakes.
Comment on lines +382 to +386
if trans_version >= version.parse("5.3.0"):
import llm_bench_utils.llm_hook_sample.hook_sample_v5_3 as hook_sample_v5_3

type(model)._sample = hook_sample_v57.new_sample
type(model)._sample = hook_sample_v5_3.new_sample
if trans_version >= version.parse("5.0"):
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version branching uses two independent if statements for >=5.3.0 and >=5.0, so on transformers 5.3+ the 5.3 hook is immediately overwritten by the 5.0 hook. Change the second check to elif (or reorder the conditions) so the most specific hook stays in effect.

Copilot uses AI. Check for mistakes.
Comment on lines +390 to +393
elif trans_version >= version.parse("4.57.0"):
import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_57 as hook_sample_v4_57

type(model)._sample = hook_sample_v4_57.new_sample
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports use tools.llm_bench..., but tools/llm_bench is not a Python package (no tools/llm_bench/__init__.py), so this branch will raise ModuleNotFoundError at runtime. Use the same import style as the v5 branches (e.g., llm_bench_utils....) or make tools.llm_bench importable by adding __init__.py files.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +20
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5_3 as hook_beam_search_v5_3

new_beam_search = hook_beam_search_v57.new_beam_search_v57
new_beam_search = hook_beam_search_v5_3.new_beam_search
elif version.parse(transformers.__version__) >= version.parse("5.0"):
import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5 as hook_beam_search_v5
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these imports reference tools.llm_bench..., but tools/llm_bench is not a Python package (missing tools/llm_bench/__init__.py), so importing this module will fail. Please switch these imports back to the importable path (likely llm_bench_utils.llm_hook_beam_search...) or add the necessary __init__.py files to make tools.llm_bench a real package.

Copilot uses AI. Check for mistakes.
Comment on lines +721 to +723
cmd: |

tests/python_tests/samples
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matrix.test.cmd is defined as a block scalar with blank lines, which will embed newlines into the value. Since this is interpolated into python -m pytest ... ${{ matrix.test.cmd }}, the shell will treat it as multiple commands and likely fail. Use a single-line string like cmd: 'tests/python_tests/samples' (consistent with the other sample entries).

Suggested change
cmd: |
tests/python_tests/samples
cmd: 'tests/python_tests/samples'

Copilot uses AI. Check for mistakes.
timeout: 60
- name: 'VLM (qwen3-vl)'
cmd: |
python -m pip install transformers==4.57.0
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job installs transformers==4.57.0 twice (once alone, then again together with optimum-intel). The first install is redundant and increases CI time; keep only the combined install line.

Suggested change
python -m pip install transformers==4.57.0

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 22, 2026 12:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 61 out of 63 changed files in this pull request and generated 5 comments.

Comment on lines +133 to +161
# Assisted generation completes the prefill stage in candidate generator so that
# we don't have several `prefill` calls in one generation loop. Skip `_prefill` for assistants
if not generation_config.is_assistant:
outputs = self._prefill(input_ids, generation_config, model_kwargs)
prefill_consumed = False
else:
model_kwargs = self._get_initial_cache_position(input_ids.shape[1], input_ids.device, model_kwargs)
prefill_consumed = True

while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
if prefill_consumed:
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
prefill_consumed = True
model_kwargs = self._update_model_kwargs_for_generation(
outputs,
model_kwargs,
is_encoder_decoder=self.config.is_encoder_decoder,
)
if synced_gpus and this_peer_finished:
continue

# Copy is needed to avoid keeping a hanging ref to outputs.logits which may be very large for first iteration
# (the clone itself is always small)
next_token_logits = outputs.logits[:, -1, :].to(copy=True, dtype=torch.float32, device=input_ids.device)

# pre-process distribution
next_token_scores = logits_processor(input_ids, next_token_logits)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformers v5 _sample hook no longer records per-token latency/infer timing into hook_greedy.tm_list / hook_greedy.tm_infer_list (unlike the v4.x hook implementations). As a result, llm_bench metrics that depend on these lists will be empty/incorrect for sampling on transformers v5. Please add the same timing collection used in hook_sample_v4_57.py (or adjust the benchmark to not expect these lists for v5).

Copilot uses AI. Check for mistakes.
Comment on lines +133 to +162
prefill_consumed = False
outputs = self._prefill(
input_ids,
generation_config,
model_kwargs,
is_first_iteration=not generation_config.is_assistant,
)

while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
if prefill_consumed:
next_sequence_length = 1 if model_kwargs["use_cache"] else None
model_inputs = self.prepare_inputs_for_generation(
input_ids, next_sequence_length=next_sequence_length, **model_kwargs
)
with self._optimize_model_for_decode():
outputs = model_forward(**model_inputs, return_dict=True)
prefill_consumed = True
model_kwargs = self._update_model_kwargs_for_generation(
outputs,
model_kwargs,
is_encoder_decoder=self.config.is_encoder_decoder,
)
if synced_gpus and this_peer_finished:
continue

# Copy is needed to avoid keeping a hanging ref to outputs.logits which may be very large for first iteration
# (the clone itself is always small)
next_token_logits = outputs.logits[:, -1, :].to(copy=True, dtype=torch.float32, device=input_ids.device)

# pre-process distribution
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as hook_sample_v5.py: this transformers v5.3+ _sample hook does not append step timings to hook_greedy.tm_list / hook_greedy.tm_infer_list, so llm_bench token latency metrics for sampling will be missing/incorrect on transformers v5.3+.

Copilot uses AI. Check for mistakes.
Comment on lines +714 to +716
cmd: |

tests/python_tests/samples
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiline cmd: | value includes a leading blank line, so matrix.test.cmd will contain a newline. When interpolated into the python -m pytest ... ${{ matrix.test.cmd }} command, this can split the shell command and fail the job. Use a single-line string like the other entries (e.g. cmd: 'tests/python_tests/samples').

Suggested change
cmd: |
tests/python_tests/samples
cmd: 'tests/python_tests/samples'

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +16
# optimum-intel[nncf]>=1.25.0,<=1.27.0
# git+https://github.com/huggingface/optimum-intel.git@423b4237df4d51efe077675eb1c531ffe4c82c83#egg=optimum-intel
git+https://github.com/huggingface/optimum-intel.git@refs/pull/1692/head#egg=optimum-intel
packaging>=20.0,<=26.0
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a moving git ref (refs/pull/1692/head) makes builds non-reproducible and can break CI at any time if the PR branch changes or is deleted. Prefer pinning to an immutable commit SHA (or a released version) for optimum-intel.

Copilot uses AI. Check for mistakes.
sentence-transformers>=2.2.2,<=5.3.0
openvino-genai
optimum-intel[nncf]>=1.19.0,<=1.27.0
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@refs/pull/1692/head
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirements entry pins optimum-intel to a mutable PR ref (refs/pull/1692/head). For stable/reproducible installs (especially in CI), pin to a specific commit SHA or a released version instead.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants