[llm_bench] Add hooks for transformers v5 by sbalandi · Pull Request #3696 · openvinotoolkit/openvino.genai

sbalandi · 2026-04-13T12:50:46Z

Description

CVS-###

Fixes #(issue)

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code.
This PR fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR updates llm_bench’s Transformers-generation hooks to cover Transformers v5 while reorganizing existing v4.x hook implementations and extending CI to exercise WWB tests against newer Transformers versions.

Changes:

Replaced v4.x hook files named *_v43/v45/v51/v52/v55 with version-explicit *_v4_43/.../v4_57 modules, and added new v5 hook modules (hook_sample_v5.py, hook_sample_v5_3.py).
Updated llm_bench hook dispatch logic to select hook implementations based on transformers.__version__.
Updated Linux CI workflow to run WWB tests with Transformers 5.2.0 and 5.5.0.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v55.py	Removed legacy v4.55 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v52.py	Removed legacy v4.52 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v51.py	Removed legacy v4.51 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v45.py	Removed legacy v4.45 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v43.py	Removed legacy v4.43 sampling hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_57.py	Added v4.57 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_55.py	Added v4.55 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_52.py	Added v4.52 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_51.py	Added v4.51 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_45.py	Added v4.45 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v4_43.py	Added v4.43 sampling hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v5.py	Added Transformers v5.0–v5.2 sampling hook module.
tools/llm_bench/llm_bench_utils/llm_hook_sample/hook_sample_v5_3.py	Added Transformers v5.3+ sampling hook module.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v55.py	Removed legacy v4.55 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v52.py	Removed legacy v4.52 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v51.py	Removed legacy v4.51 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v40.py	Removed legacy v4.40 beam-search hook module (old naming).
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_57.py	Added v4.57 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_55.py	Added v4.55 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_52.py	Added v4.52 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_51.py	Added v4.51 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/llm_hook_beam_search/hook_beam_search_v4_40.py	Added v4.40 beam-search hook module under version-explicit naming.
tools/llm_bench/llm_bench_utils/hook_greedy_search.py	Updated version-based hook selection to include Transformers v5 hooks.
tools/llm_bench/llm_bench_utils/hook_beam_search.py	Updated version-based beam-search hook selection to include Transformers v5 hooks.
.github/workflows/linux.yml	Added WWB CI runs with Transformers 5.2.0 and 5.5.0.

Copilot · 2026-04-13T12:58:49Z


-            type(model)._sample = hook_sample_v57.new_sample
+            type(model)._sample = hook_sample_v5_3.new_sample
+        if trans_version >= version.parse("5.0"):


Both if trans_version >= 5.3.0 and if trans_version >= 5.0 will run for Transformers >=5.3, so type(model)._sample gets overwritten and the v5.3+ hook is effectively never used. Change the second condition to elif (or otherwise ensure mutually exclusive branching) so Transformers >=5.3 uses hook_sample_v5_3.new_sample as intended.

Suggested change

if trans_version >= version.parse("5.0"):

elif trans_version >= version.parse("5.0"):

Copilot · 2026-04-13T12:58:50Z

+        elif trans_version >= version.parse("4.57.0"):
+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_57 as hook_sample_v4_57
+
+            type(model)._sample = hook_sample_v4_57.new_sample
        elif trans_version >= version.parse("4.55.0"):


These branches import hook modules via tools.llm_bench..., while the rest of llm_bench imports use the llm_bench_utils... package (and the README examples run python benchmark.py from tools/llm_bench). Using the tools. prefix can raise ModuleNotFoundError depending on the working directory/PYTHONPATH. Please align these imports with the existing llm_bench_utils.llm_hook_sample... style for consistent runtime behavior.

Copilot · 2026-04-13T12:58:50Z

+if version.parse(transformers.__version__) >= version.parse("5.3.0"):
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5_3 as hook_beam_search_v5_3

-    new_beam_search = hook_beam_search_v57.new_beam_search_v57
+    new_beam_search = hook_beam_search_v5_3.new_beam_search


hook_beam_search.py now imports version-specific implementations via tools.llm_bench.... Most of llm_bench is imported/run as the llm_bench_utils... package (e.g., from tools/llm_bench/benchmark.py), so this can break execution when the repo root isn’t on PYTHONPATH. Prefer the existing llm_bench_utils.llm_hook_beam_search... import style to keep the hook usable in the standard invocation flow.

Copilot · 2026-04-13T12:58:50Z

+    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
+        if prefill_consumed:
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)


This Transformers v5 sampling hook is described as collecting latency, but it never records per-iteration total time or forward/inference time into hook_greedy.tm_list / hook_greedy.tm_infer_list (unlike the v4.x hooks). As a result, llm_bench latency metrics will be empty/incorrect for Transformers v5. Please add timing collection around the forward call and the loop body, consistent with the existing v4.x hook implementations.

Copilot · 2026-04-13T12:58:51Z

+            model_inputs = self.prepare_inputs_for_generation(
+                input_ids, next_sequence_length=next_sequence_length, **model_kwargs
+            )
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)


This Transformers v5.3+ sampling hook doesn’t append any timings to hook_greedy.tm_list / hook_greedy.tm_infer_list, even though the hook’s purpose is latency collection. That will produce empty/incorrect benchmark latency outputs for Transformers v5.3+. Please add timing collection around the decode forward call and the full iteration, consistent with the v4.x hooks.

Copilot · 2026-04-13T12:58:51Z

+        if self.config._attn_implementation == "flash_attention_2" and getattr(
+            model_kwargs.get("past_key_values"), "is_compileable", False
+        ):
+            if generation_config.compile_config is None:
+                generation_config.compile_config = CompileConfig(fullgraph=False)


CompileConfig is referenced when generation_config.compile_config is None, but it is not imported in this module. If auto-compile is enabled, this will raise a NameError. Import CompileConfig from the appropriate Transformers module (matching the upstream version you copied from) or avoid constructing it here.

Copilot · 2026-04-13T12:58:51Z

          - name: 'WWB tests'
-            cmd: 'python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"'
+            cmd: |
+              python -m pip install transformers==5.2.0
+              python -m pip show transformers
+              python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"


There are now two matrix entries with the identical name WWB tests, which will show up as duplicate job names in CI and make failures harder to triage. Consider renaming them to include the Transformers version (e.g., WWB tests (transformers 5.2.0) / WWB tests (transformers 5.5.0)).

Copilot · 2026-04-13T12:58:51Z

@@ -618,7 +618,17 @@ jobs:
            run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).RAG.test }}
            timeout: 30
          - name: 'WWB tests'


PR description still contains placeholders (e.g., CVS-###, Fixes #(issue)) and the checklist is not filled out. Please update the PR description to match the repository PR template/checklist before merging so reviewers can validate scope, tests, and docs updates.

Copilot

Pull request overview

Copilot reviewed 29 out of 31 changed files in this pull request and generated 7 comments.

Copilot · 2026-04-14T19:57:12Z

+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_57 as hook_sample_v4_57
+
+            type(model)._sample = hook_sample_v4_57.new_sample
        elif trans_version >= version.parse("4.55.0"):
-            import llm_bench_utils.llm_hook_sample.hook_sample_v55 as hook_sample_v55
-            model._sample = hook_sample_v55.new_sample.__get__(model, model.__class__)
+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_55 as hook_sample_v4_55
+
+            model._sample = hook_sample_v4_55.new_sample.__get__(model, model.__class__)
        elif trans_version >= version.parse('4.52.0'):
-            import llm_bench_utils.llm_hook_sample.hook_sample_v52 as hook_sample_v52
-            model._sample = hook_sample_v52.new_sample.__get__(model, model.__class__)
+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_52 as hook_sample_v4_52
+
+            model._sample = hook_sample_v4_52.new_sample.__get__(model, model.__class__)
        elif trans_version >= version.parse('4.51.0'):
-            import llm_bench_utils.llm_hook_sample.hook_sample_v51 as hook_sample_v51
-            model._sample = hook_sample_v51.new_sample.__get__(model, model.__class__)
+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_51 as hook_sample_v4_51
+
+            model._sample = hook_sample_v4_51.new_sample.__get__(model, model.__class__)
        elif trans_version >= version.parse('4.45.0'):
-            import llm_bench_utils.llm_hook_sample.hook_sample_v45 as hook_sample_v45
-            model._sample = hook_sample_v45.new_sample.__get__(model, model.__class__)
+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_45 as hook_sample_v4_45
+
+            model._sample = hook_sample_v4_45.new_sample.__get__(model, model.__class__)
        elif trans_version >= version.parse('4.43.0'):
-            import llm_bench_utils.llm_hook_sample.hook_sample_v43 as hook_sample_v43
-            model._sample = hook_sample_v43.new_sample.__get__(model, model.__class__)
+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_43 as hook_sample_v4_43
+
+            model._sample = hook_sample_v4_43.new_sample.__get__(model, model.__class__)


These imports use the fully-qualified module path starting with tools.llm_bench..., but tools/llm_bench/README.md instructs running the tool as python benchmark.py from inside tools/llm_bench/. In that execution mode, the repo root (and therefore the tools package) is not on sys.path, so these imports will fail. Prefer importing from llm_bench_utils... consistently (as is done elsewhere in this file) or explicitly ensure the repo root is on PYTHONPATH before using tools.* imports.

Copilot · 2026-04-14T19:57:13Z

 transformers[sentencepiece]>=4.35.2,<5.4.0
 sentence-transformers>=2.2.2,<=5.3.0


This file still pins transformers[sentencepiece] to <5.4.0, but CI is now explicitly testing Transformers 5.5 for WWB. Please align the declared dependency range with the versions you intend to support/test (either relax the upper bound or drop the 5.5 CI run).

Copilot · 2026-04-14T19:57:13Z

+    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
+        if prefill_consumed:
+            next_sequence_length = 1 if model_kwargs["use_cache"] else None
+            model_inputs = self.prepare_inputs_for_generation(
+                input_ids, next_sequence_length=next_sequence_length, **model_kwargs
+            )
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)
+        prefill_consumed = True
+        model_kwargs = self._update_model_kwargs_for_generation(
+            outputs,
+            model_kwargs,
+            is_encoder_decoder=self.config.is_encoder_decoder,
+        )
+        if synced_gpus and this_peer_finished:
+            continue
+
+        # Copy is needed to avoid keeping a hanging ref to outputs.logits which may be very large for first iteration
+        # (the clone itself is always small)
+        next_token_logits = outputs.logits[:, -1, :].to(copy=True, dtype=torch.float32, device=input_ids.device)


Same as hook_sample_v5: this v5.3+ hook imports time and llm_bench_utils.hook_greedy_search but does not record any timings. If llm_bench relies on _sample timing for do_sample runs, please add timing collection (total step + forward-pass time) consistent with the v4.x hook implementations.

Copilot · 2026-04-14T19:57:13Z

+        if trans_version >= version.parse("5.3.0"):
+            import llm_bench_utils.llm_hook_sample.hook_sample_v5_3 as hook_sample_v5_3

-            type(model)._sample = hook_sample_v57.new_sample
+            type(model)._sample = hook_sample_v5_3.new_sample
+        if trans_version >= version.parse("5.0"):
+            import llm_bench_utils.llm_hook_sample.hook_sample_v5 as hook_sample_v5
+
+            type(model)._sample = hook_sample_v5.new_sample
+        elif trans_version >= version.parse("4.57.0"):


For Transformers >= 5.3.0 this code sets _sample twice because the second condition is an if, not an elif. As written, 5.3+ will first bind hook_sample_v5_3.new_sample and then immediately overwrite it with hook_sample_v5.new_sample, so the v5.3-specific hook is never used. Convert this to a single if/elif/... chain (or match) so only one implementation is selected per version.

Copilot · 2026-04-14T19:57:14Z

+if version.parse(transformers.__version__) >= version.parse("5.3.0"):
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5_3 as hook_beam_search_v5_3

-    new_beam_search = hook_beam_search_v57.new_beam_search_v57
+    new_beam_search = hook_beam_search_v5_3.new_beam_search
+elif version.parse(transformers.__version__) >= version.parse("5.0"):
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5 as hook_beam_search_v5
+
+    new_beam_search = hook_beam_search_v5.new_beam_search
+elif version.parse(transformers.__version__) >= version.parse("4.57.0"):
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_57 as hook_beam_search_v4_57
+
+    new_beam_search = hook_beam_search_v4_57.new_beam_search_v57
 elif version.parse(transformers.__version__) >= version.parse("4.55.0"):
-    import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v55 as hook_beam_search_v55
-    new_beam_search = hook_beam_search_v55.new_beam_search_v55
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_55 as hook_beam_search_v4_55
+
+    new_beam_search = hook_beam_search_v4_55.new_beam_search_v55
 elif version.parse(transformers.__version__) >= version.parse("4.52.0"):
-    import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v52 as hook_beam_search_v52
-    new_beam_search = hook_beam_search_v52.new_beam_search_v52
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_52 as hook_beam_search_v4_52
+
+    new_beam_search = hook_beam_search_v4_52.new_beam_search_v52
 elif version.parse(transformers.__version__) >= version.parse("4.51.0"):
-    import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v51 as hook_beam_search_v51
-    new_beam_search = hook_beam_search_v51.new_beam_search_v51
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_51 as hook_beam_search_v4_51
+
+    new_beam_search = hook_beam_search_v4_51.new_beam_search_v51
 else:
-    import llm_bench_utils.llm_hook_beam_search.hook_beam_search_v40 as hook_beam_search_v40
-    new_beam_search = hook_beam_search_v40.new_beam_search_v40
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v4_40 as hook_beam_search_v4_40
+
+    new_beam_search = hook_beam_search_v4_40.new_beam_search_v40


All of these imports use tools.llm_bench... module paths. However, tools/llm_bench/README.md documents running the benchmark as python benchmark.py from within tools/llm_bench/, which does not put the repo root on sys.path. In that common invocation, import tools... will raise ModuleNotFoundError. Use llm_bench_utils.llm_hook_beam_search... imports (consistent with the rest of llm_bench) or adjust the entrypoint so the repo root is on PYTHONPATH.

Copilot · 2026-04-14T19:57:14Z

@@ -10,7 +10,8 @@ torch>=2.1.0,<=2.11
 transformers[sentencepiece]>=4.40.0,<5.4.0


This file still constrains transformers[sentencepiece] to <5.4.0, but the PR adds hooks/CI for Transformers v5.5 (see workflow changes) and the new hook module comments claim support beyond 5.4. If the intent is to support 5.5, update the version constraint accordingly; otherwise, CI should not test versions outside the declared range.

Suggested change

transformers[sentencepiece]>=4.40.0,<5.4.0

transformers[sentencepiece]>=4.40.0,<5.6.0

Copilot · 2026-04-14T19:57:14Z

+    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
+        if prefill_consumed:
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)
+        prefill_consumed = True
+        model_kwargs = self._update_model_kwargs_for_generation(
+            outputs,
+            model_kwargs,


These new _sample hooks claim to add latency collection, but there is no timing instrumentation (no time.perf_counter() usage and no writes to hook_greedy.tm_list/tm_infer_list). This looks like a regression compared to the previous v4.x hook implementations where per-step and infer timings were recorded. If sampling latency is still needed for llm_bench metrics, add the same timing updates inside the generation loop and around the model forward call.

Copilot

Pull request overview

Copilot reviewed 29 out of 31 changed files in this pull request and generated 6 comments.

Copilot · 2026-04-15T11:18:08Z

 --extra-index-url https://download.pytorch.org/whl/cpu
 diffusers==0.37.1
-https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel
+optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5


Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes CI/test installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version.

Suggested change

optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5

optimum-intel==1.25.2

Copilot · 2026-04-15T11:18:08Z

+# https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel
+optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5


Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version.

Suggested change

# https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel

optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5

optimum-intel @ https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel

Copilot · 2026-04-15T11:18:08Z

+# optimum-intel[nncf]>=1.19.0,<=1.27.0
+optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5


Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version (as is already done elsewhere in this workflow for optimum-intel).

Copilot · 2026-04-15T11:18:09Z

          - name: 'WWB tests'
-            cmd: 'python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"'
+            cmd: |
+              python -m pip install transformers==5.2.0 git+https://github.com/huggingface/optimum-intel.git@transformers-v5
+              python -m pip show transformers
+              python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"
+            run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }}
+            timeout: 120
+          - name: 'WWB tests'
+            cmd: |
+              python -m pip install transformers==5.5.0 git+https://github.com/huggingface/optimum-intel.git@transformers-v5
+              python -m pip show transformers
+              python -m pytest -v ./tools/who_what_benchmark/tests -m "not nanollava"
            run_condition: ${{ fromJSON(needs.smart_ci.outputs.affected_components).WWB.test }}


The matrix now has two entries both named 'WWB tests', which makes CI logs ambiguous. Consider naming them distinctly (e.g., include the transformers version). Also, this installs transformers==5.5.0 while the WWB requirements file currently constrains transformers to <5.4.0; please align the declared constraints with what CI is testing (or vice versa) to avoid confusion about supported versions.

Copilot · 2026-04-15T11:18:09Z

+# Transformers version: v5.3.0, v5.4.0, v5.5.0, v5.5.1, v5.5.2, v5.5.3
+# https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/generation/utils.py#L3068
+# Add the function of collecting latency
+def new_sample(
+    self,
+    input_ids: torch.LongTensor,
+    logits_processor: LogitsProcessorList,
+    stopping_criteria: StoppingCriteriaList,
+    generation_config: GenerationConfig,
+    synced_gpus: bool = False,
+    streamer: Optional["BaseStreamer"] = None,
+    **model_kwargs,
+) -> GenerateNonBeamOutput | torch.LongTensor:
+    r"""
+    Generates sequences of token ids for models with a language modeling head using **multinomial sampling** and
+    can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
+
+    Parameters:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            The sequence used as a prompt for the generation.
+        logits_processor (`LogitsProcessorList`):
+            An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
+            used to modify the prediction scores of the language modeling head applied at each generation step.
+        stopping_criteria (`StoppingCriteriaList`):
+            An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`]
+            used to tell if the generation loop should stop.
+        generation_config ([`~generation.GenerationConfig`]):
+            The generation configuration to be used as parametrization of the decoding method.
+        synced_gpus (`bool`):
+            Whether to continue running the while loop until max_length (needed to avoid deadlocking with
+            `FullyShardedDataParallel` and DeepSpeed ZeRO Stage 3).
+        streamer (`BaseStreamer`, *optional*):
+            Streamer object that will be used to stream the generated sequences. Generated tokens are passed
+            through `streamer.put(token_ids)` and the streamer is responsible for any further processing.
+        model_kwargs:
+            Additional model specific kwargs will be forwarded to the `forward` function of the model. If model is
+            an encoder-decoder model the kwargs should include `encoder_outputs`.
+
+    Return:
+        [`~generation.GenerateDecoderOnlyOutput`], [`~generation.GenerateEncoderDecoderOutput`] or `torch.LongTensor`:
+        A `torch.LongTensor` containing the generated tokens (default behaviour) or a
+        [`~generation.GenerateDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and
+        `return_dict_in_generate=True` or a [`~generation.GenerateEncoderDecoderOutput`] if
+        `model.config.is_encoder_decoder=True`.
+    """
+    # init values
+    pad_token_id = generation_config._pad_token_tensor
+    output_attentions = generation_config.output_attentions
+    output_hidden_states = generation_config.output_hidden_states
+    output_scores = generation_config.output_scores
+    output_logits = generation_config.output_logits
+    return_dict_in_generate = generation_config.return_dict_in_generate
+    has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria)
+    do_sample = generation_config.do_sample
+
+    # init attention / hidden states / scores tuples
+    scores = () if (return_dict_in_generate and output_scores) else None
+    raw_logits = () if (return_dict_in_generate and output_logits) else None
+    decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
+    cross_attentions = () if (return_dict_in_generate and output_attentions) else None
+    decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None
+
+    # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
+    if return_dict_in_generate and self.config.is_encoder_decoder:
+        encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
+        encoder_hidden_states = model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
+
+    # keep track of which sequences are already finished
+    batch_size = input_ids.shape[0]
+    this_peer_finished = False
+    unfinished_sequences = torch.ones(batch_size, dtype=torch.long, device=input_ids.device)
+
+    model_forward = (
+        self.get_compiled_call(generation_config.compile_config)
+        if self._valid_auto_compile_criteria(model_kwargs, generation_config)
+        else self.__call__
+    )
+
+    prefill_consumed = False
+    outputs = self._prefill(
+        input_ids,
+        generation_config,
+        model_kwargs,
+        is_first_iteration=not generation_config.is_assistant,
+    )
+
+    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
+        if prefill_consumed:
+            next_sequence_length = 1 if model_kwargs["use_cache"] else None
+            model_inputs = self.prepare_inputs_for_generation(
+                input_ids, next_sequence_length=next_sequence_length, **model_kwargs
+            )
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)
+        prefill_consumed = True


This hook is documented as adding latency collection, but unlike the existing v4.x hooks it never records per-iteration timing (e.g., updating hook_greedy.tm_infer_list / hook_greedy.tm_list). As a result, llm_bench will not report token/inference latency for transformers v5.3+ paths. Please add the same timing instrumentation used in hook_sample_v4_57.py around the model forward pass and the full loop iteration.

Copilot · 2026-04-15T11:18:09Z

 # optimum is in dependency list of optimum-intel
-optimum-intel[nncf]>=1.25.0,<=1.27.0
+# optimum-intel[nncf]>=1.25.0,<=1.27.0
+optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5


Using an unpinned VCS dependency (optimum-intel @ ...@transformers-v5) makes installs non-reproducible and is a supply-chain risk because the branch can change over time. Prefer pinning to an immutable commit SHA or a released tag/version.

Suggested change

optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5

optimum-intel==1.27.0

Copilot

Pull request overview

Copilot reviewed 29 out of 31 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-15T12:24:49Z

+    model_forward = (
+        self.get_compiled_call(generation_config.compile_config)
+        if self._valid_auto_compile_criteria(model_kwargs, generation_config)
+        else self.__call__
+    )
+
+    prefill_consumed = False
+    outputs = self._prefill(
+        input_ids,
+        generation_config,
+        model_kwargs,
+        is_first_iteration=not generation_config.is_assistant,
+    )
+
+    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
+        if prefill_consumed:
+            next_sequence_length = 1 if model_kwargs["use_cache"] else None
+            model_inputs = self.prepare_inputs_for_generation(
+                input_ids, next_sequence_length=next_sequence_length, **model_kwargs
+            )
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)
+        prefill_consumed = True
+        model_kwargs = self._update_model_kwargs_for_generation(
+            outputs,
+            model_kwargs,
+            is_encoder_decoder=self.config.is_encoder_decoder,
+        )
+        if synced_gpus and this_peer_finished:
+            continue


The v5.3+ sampling hook is supposed to collect per-step latency (as done in v4.x hooks), but this implementation never appends to hook_greedy.tm_infer_list / hook_greedy.tm_list and therefore breaks latency reporting for do_sample in llm_bench. Add the same timing instrumentation around the decode forward pass and per-iteration loop used in the v4.x hook implementations.

Copilot

Pull request overview

Copilot reviewed 37 out of 39 changed files in this pull request and generated 3 comments.

Copilot · 2026-04-20T16:49:48Z

      - name: Install dependencies
        uses: ./src/.github/actions/install_wheel
-        with:
-          packages: openvino;openvino_tokenizers[transformers]
-          requirements_files: "${{ env.SRC_DIR }}/tests/python_tests/requirements.txt"
-          local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels
+        run: |
+          python -m pip install -r ${{ env.SRC_DIR }}/tests/python_tests/requirements.txt
+          python -m pip install openvino openvino_tokenizers[transformers]  --extra-index-url=https://storage.openvinotoolkit.org/simple/wheels/nightly


This step defines both uses: and run: in the same step, which is invalid GitHub Actions YAML and will fail workflow parsing. Convert this step to either a composite action step (uses + with) or a shell step (run) but not both.

Copilot · 2026-04-20T16:49:48Z

            timeout: 60
          - name: 'VLM (qwen3-vl)'
            cmd: |
+              python -m pip install transformers==4.57.0


The qwen3-vl job installs transformers==4.57.0 twice (first alone, then again with the optimum-intel git dependency). Drop the redundant install to reduce time and avoid potential resolver churn.

Suggested change

python -m pip install transformers==4.57.0

Copilot · 2026-04-20T16:49:49Z

+@pytest.mark.transformers_lower_v5
 @pytest.mark.parametrize("model_descr", get_whisper_models_list(tiny_only=True))
 @pytest.mark.parametrize("sample_from_dataset", [{"language": "en", "sample_id": 1}], indirect=True)
 @pytest.mark.xfail(condition=(sys.platform == "darwin"), reason="Ticket - 173169")


@pytest.mark.transformers_lower_v5 is newly introduced but is not registered in tests/python_tests/pytest.ini markers list. Add it there (or reuse an existing marker) to avoid unknown-marker warnings/errors in CI (especially if --strict-markers is enabled).

Copilot

Pull request overview

Copilot reviewed 60 out of 62 changed files in this pull request and generated 8 comments.

Copilot · 2026-04-21T18:07:13Z

+        if hf_generation_config.num_return_sequences is None:
+            num_return_sequences = 1
+
        for idx, hf_encoded_out in enumerate(hf_encoded_outputs.sequences):
-            prompt_idx = idx // hf_generation_config.num_return_sequences
+            prompt_idx = idx // num_return_sequences
            prompt_len = 0 if generation_configs.echo else input_ids[prompt_idx].numel()


num_return_sequences is only set when hf_generation_config.num_return_sequences is None, but it’s used unconditionally in the loop below. When num_return_sequences is not None, this will raise UnboundLocalError. Initialize it to hf_generation_config.num_return_sequences or 1 before the loop.

Copilot · 2026-04-21T18:07:14Z

+        if trans_version >= version.parse("5.3.0"):
+            import llm_bench_utils.llm_hook_sample.hook_sample_v5_3 as hook_sample_v5_3

-            type(model)._sample = hook_sample_v57.new_sample
+            type(model)._sample = hook_sample_v5_3.new_sample
+        if trans_version >= version.parse("5.0"):
+            import llm_bench_utils.llm_hook_sample.hook_sample_v5 as hook_sample_v5
+
+            type(model)._sample = hook_sample_v5.new_sample
+        elif trans_version >= version.parse("4.57.0"):


The version checks are not mutually exclusive: for transformers >= 5.3, the first if sets _sample to v5_3, but the following if trans_version >= 5.0 then overwrites it with v5. This makes the v5.3 hook unreachable. Change the second if to elif (or otherwise ensure a single branch runs).

Copilot · 2026-04-21T18:07:14Z

+        if is_transformers_version("<", "5.0"):
+            processor.audio_tokenizer = None
+        else:
+            processor.pop("audio_tokenizer", None)


processor is a transformers ProcessorMixin instance, not a dict, so processor.pop("audio_tokenizer", None) will raise AttributeError. If you need to remove this field for transformers>=5, use delattr(processor, "audio_tokenizer") (guarded with hasattr) or set it to None consistently.

Suggested change

processor.pop("audio_tokenizer", None)

if hasattr(processor, "audio_tokenizer"):

delattr(processor, "audio_tokenizer")

Copilot · 2026-04-21T18:07:14Z

+if Version(transformers_version) < "5.0.0":
+    VISUAL_VIDEO_TEXT_MODELS += (("optimum-intel-internal-testing/tiny-random-llava-next-video", "visual-video-text"),)
+


Version(transformers_version) < "5.0.0" compares a Version object to a string, which raises TypeError in packaging. Compare against Version("5.0.0") (or use packaging.version.parse) so the conditional works at runtime.

Copilot · 2026-04-21T18:07:15Z

+def get_hf_cli_command() -> str:
+    # huggingface_hub < 1.0 supports huggingface-cli, huggingface_hub >= 1.0 use hf
+    version = importlib.metadata.version("huggingface_hub")
+    major = int(version.split(".", 1)[0])
+    if major >= 1:
+        return "hf"
+    return "huggingface-cli"


importlib.metadata isn’t imported here (import importlib alone doesn’t load the metadata submodule), so importlib.metadata.version(...) will raise AttributeError. Use the already-imported metadata (metadata.version(...)) or import importlib.metadata / from importlib import metadata.

Copilot · 2026-04-21T18:07:15Z

+def get_hf_cli_command() -> str:
+    # huggingface_hub < 1.0 supports huggingface-cli, huggingface_hub >= 1.0 use hf
+    version = importlib.metadata.version("huggingface_hub")
+    major = int(version.split(".", 1)[0])
+    if major >= 1:
+        return "hf"
+    return "huggingface-cli"


importlib.metadata is not imported (only import importlib), so importlib.metadata.version(...) will fail at runtime. Import importlib.metadata / from importlib import metadata, or use metadata.version(...).

Copilot · 2026-04-21T18:07:15Z

+    if (
+        is_transformers_version(">=", "5.0")
+        and generation_config is not None
+        and generation_config.get("num_beam_groups", 1) > 1


extra_generate_kwargs() treats generation_config as a dict (generation_config.get(...)), but current call sites pass a transformers.GenerationConfig instance. This will raise AttributeError. Either accept HFGenerationConfig and use getattr(generation_config, "num_beam_groups", 1), or pass generation_config.to_dict() from the caller.

Suggested change

if (

is_transformers_version(">=", "5.0")

and generation_config is not None

and generation_config.get("num_beam_groups", 1) > 1

num_beam_groups = 1

if generation_config is not None:

if isinstance(generation_config, dict):

num_beam_groups = generation_config.get("num_beam_groups", 1)

else:

num_beam_groups = getattr(generation_config, "num_beam_groups", 1)

if (

is_transformers_version(">=", "5.0")

and generation_config is not None

and num_beam_groups > 1

Copilot · 2026-04-21T18:07:15Z

            generate_outputs = opt_model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                generation_config=hf_generation_config,
                tokenizer=hf_tokenizer,
-                **extra_generate_kwargs(),
+                **extra_generate_kwargs(hf_generation_config),
            )


extra_generate_kwargs(hf_generation_config) passes a transformers.GenerationConfig object, but extra_generate_kwargs currently expects a dict-like object (uses .get). This will raise at runtime; either pass hf_generation_config.to_dict() here or change extra_generate_kwargs to handle GenerationConfig.

Copilot

Pull request overview

Copilot reviewed 61 out of 63 changed files in this pull request and generated 7 comments.

Copilot · 2026-04-21T22:57:27Z

      - name: Install GenAI Wheels
        if: ${{ matrix.test.run_condition }}
-        uses: ./src/.github/actions/install_wheel
-        with:
-          packages: "openvino;openvino_tokenizers[transformers];openvino_genai;whowhatbench"
-          requirements_files: "${{ env.SRC_DIR }}/tests/python_tests/requirements.txt"
-          local_wheel_dir: ${{ env.INSTALL_DIR }}/wheels
-
+        run: |
+          python -m pip install ${{ env.INSTALL_DIR }}/wheels/whowhatbench-*.whl --extra-index-url https://download.pytorch.org/whl/cpu
+          python -m pip install -r ${{ env.SRC_DIR }}/tests/python_tests/requirements.txt
+          python -m pip install openvino openvino_tokenizers[transformers] openvino_genai[testing] --extra-index-url=https://storage.openvinotoolkit.org/simple/wheels/nightly


PR description still contains placeholders (CVS-###, Fixes #(issue)) and the checklist is entirely unchecked. Please update the description to match the PR template (real ticket/issue links, completed checklist, and a brief summary of test coverage for the transformers v5 hooks).

Copilot · 2026-04-21T22:57:27Z

+    assert False
    llm_pipe = openvino_genai.VLMPipeline(model_path, device.upper(), **ov_config)


assert False will unconditionally abort VLM pipeline creation, so llm_bench cannot benchmark GenAI VLMs. Please remove this debug assertion (or gate it behind an explicit debug flag) so VLMPipeline(...) can be constructed.

Copilot · 2026-04-21T22:57:27Z

+        if trans_version >= version.parse("5.3.0"):
+            import llm_bench_utils.llm_hook_sample.hook_sample_v5_3 as hook_sample_v5_3

-            type(model)._sample = hook_sample_v57.new_sample
+            type(model)._sample = hook_sample_v5_3.new_sample
+        if trans_version >= version.parse("5.0"):


The version branching uses two independent if statements for >=5.3.0 and >=5.0, so on transformers 5.3+ the 5.3 hook is immediately overwritten by the 5.0 hook. Change the second check to elif (or reorder the conditions) so the most specific hook stays in effect.

Copilot · 2026-04-21T22:57:27Z

+        elif trans_version >= version.parse("4.57.0"):
+            import tools.llm_bench.llm_bench_utils.llm_hook_sample.hook_sample_v4_57 as hook_sample_v4_57
+
+            type(model)._sample = hook_sample_v4_57.new_sample


These imports use tools.llm_bench..., but tools/llm_bench is not a Python package (no tools/llm_bench/__init__.py), so this branch will raise ModuleNotFoundError at runtime. Use the same import style as the v5 branches (e.g., llm_bench_utils....) or make tools.llm_bench importable by adding __init__.py files.

Copilot · 2026-04-21T22:57:28Z

+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5_3 as hook_beam_search_v5_3

-    new_beam_search = hook_beam_search_v57.new_beam_search_v57
+    new_beam_search = hook_beam_search_v5_3.new_beam_search
+elif version.parse(transformers.__version__) >= version.parse("5.0"):
+    import tools.llm_bench.llm_bench_utils.llm_hook_beam_search.hook_beam_search_v5 as hook_beam_search_v5


All these imports reference tools.llm_bench..., but tools/llm_bench is not a Python package (missing tools/llm_bench/__init__.py), so importing this module will fail. Please switch these imports back to the importable path (likely llm_bench_utils.llm_hook_beam_search...) or add the necessary __init__.py files to make tools.llm_bench a real package.

Copilot · 2026-04-21T22:57:28Z

+            cmd: |
+
+              tests/python_tests/samples


matrix.test.cmd is defined as a block scalar with blank lines, which will embed newlines into the value. Since this is interpolated into python -m pytest ... ${{ matrix.test.cmd }}, the shell will treat it as multiple commands and likely fail. Use a single-line string like cmd: 'tests/python_tests/samples' (consistent with the other sample entries).

Suggested change

cmd: |

tests/python_tests/samples

cmd: 'tests/python_tests/samples'

Copilot · 2026-04-21T22:57:28Z

            timeout: 60
          - name: 'VLM (qwen3-vl)'
            cmd: |
+              python -m pip install transformers==4.57.0


This job installs transformers==4.57.0 twice (once alone, then again together with optimum-intel). The first install is redundant and increases CI time; keep only the combined install line.

Suggested change

python -m pip install transformers==4.57.0

Copilot

Pull request overview

Copilot reviewed 61 out of 63 changed files in this pull request and generated 5 comments.

Copilot · 2026-04-22T13:06:21Z

+    # Assisted generation completes the prefill stage in candidate generator so that
+    # we don't have several `prefill` calls in one generation loop. Skip `_prefill` for assistants
+    if not generation_config.is_assistant:
+        outputs = self._prefill(input_ids, generation_config, model_kwargs)
+        prefill_consumed = False
+    else:
+        model_kwargs = self._get_initial_cache_position(input_ids.shape[1], input_ids.device, model_kwargs)
+        prefill_consumed = True
+
+    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
+        if prefill_consumed:
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)
+        prefill_consumed = True
+        model_kwargs = self._update_model_kwargs_for_generation(
+            outputs,
+            model_kwargs,
+            is_encoder_decoder=self.config.is_encoder_decoder,
+        )
+        if synced_gpus and this_peer_finished:
+            continue
+
+        # Copy is needed to avoid keeping a hanging ref to outputs.logits which may be very large for first iteration
+        # (the clone itself is always small)
+        next_token_logits = outputs.logits[:, -1, :].to(copy=True, dtype=torch.float32, device=input_ids.device)
+
+        # pre-process distribution
+        next_token_scores = logits_processor(input_ids, next_token_logits)


This transformers v5 _sample hook no longer records per-token latency/infer timing into hook_greedy.tm_list / hook_greedy.tm_infer_list (unlike the v4.x hook implementations). As a result, llm_bench metrics that depend on these lists will be empty/incorrect for sampling on transformers v5. Please add the same timing collection used in hook_sample_v4_57.py (or adjust the benchmark to not expect these lists for v5).

Copilot · 2026-04-22T13:06:22Z

+    prefill_consumed = False
+    outputs = self._prefill(
+        input_ids,
+        generation_config,
+        model_kwargs,
+        is_first_iteration=not generation_config.is_assistant,
+    )
+
+    while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
+        if prefill_consumed:
+            next_sequence_length = 1 if model_kwargs["use_cache"] else None
+            model_inputs = self.prepare_inputs_for_generation(
+                input_ids, next_sequence_length=next_sequence_length, **model_kwargs
+            )
+            with self._optimize_model_for_decode():
+                outputs = model_forward(**model_inputs, return_dict=True)
+        prefill_consumed = True
+        model_kwargs = self._update_model_kwargs_for_generation(
+            outputs,
+            model_kwargs,
+            is_encoder_decoder=self.config.is_encoder_decoder,
+        )
+        if synced_gpus and this_peer_finished:
+            continue
+
+        # Copy is needed to avoid keeping a hanging ref to outputs.logits which may be very large for first iteration
+        # (the clone itself is always small)
+        next_token_logits = outputs.logits[:, -1, :].to(copy=True, dtype=torch.float32, device=input_ids.device)
+
+        # pre-process distribution


Same issue as hook_sample_v5.py: this transformers v5.3+ _sample hook does not append step timings to hook_greedy.tm_list / hook_greedy.tm_infer_list, so llm_bench token latency metrics for sampling will be missing/incorrect on transformers v5.3+.

Copilot · 2026-04-22T13:06:22Z

+            cmd: |
+
+              tests/python_tests/samples


The multiline cmd: | value includes a leading blank line, so matrix.test.cmd will contain a newline. When interpolated into the python -m pytest ... ${{ matrix.test.cmd }} command, this can split the shell command and fail the job. Use a single-line string like the other entries (e.g. cmd: 'tests/python_tests/samples').

Suggested change

cmd: |

tests/python_tests/samples

cmd: 'tests/python_tests/samples'

Copilot · 2026-04-22T13:06:23Z

+# optimum-intel[nncf]>=1.25.0,<=1.27.0
+# git+https://github.com/huggingface/optimum-intel.git@423b4237df4d51efe077675eb1c531ffe4c82c83#egg=optimum-intel
+git+https://github.com/huggingface/optimum-intel.git@refs/pull/1692/head#egg=optimum-intel
+packaging>=20.0,<=26.0


Using a moving git ref (refs/pull/1692/head) makes builds non-reproducible and can break CI at any time if the PR branch changes or is deleted. Prefer pinning to an immutable commit SHA (or a released version) for optimum-intel.

Copilot · 2026-04-22T13:06:23Z

 sentence-transformers>=2.2.2,<=5.3.0
 openvino-genai
-optimum-intel[nncf]>=1.19.0,<=1.27.0
+optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@refs/pull/1692/head


This requirements entry pins optimum-intel to a mutable PR ref (refs/pull/1692/head). For stable/reproducible installs (especially in CI), pin to a specific commit SHA or a released version instead.

sbalandi requested a review from akashchi as a code owner April 13, 2026 12:50

Copilot AI review requested due to automatic review settings April 13, 2026 12:50

github-actions Bot added category: llm_bench Label for tool/llm_bench folder category: GHA CI based on Github actions labels Apr 13, 2026

Copilot started reviewing on behalf of sbalandi April 13, 2026 12:51 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

sbalandi force-pushed the hook_tr5 branch from ae90773 to dbca414 Compare April 14, 2026 16:50

sbalandi requested review from Wovchena, as-suvorov and sgonorov as code owners April 14, 2026 16:50

github-actions Bot added category: WWB PR changes WWB category: GGUF GGUF file reader category: samples dependencies category: tests dependencies category: tools_dependencies labels Apr 14, 2026

sbalandi marked this pull request as draft April 14, 2026 17:09

Copilot AI review requested due to automatic review settings April 14, 2026 19:52

sbalandi force-pushed the hook_tr5 branch from dbca414 to 4677506 Compare April 14, 2026 19:52

Copilot started reviewing on behalf of sbalandi April 14, 2026 19:52 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

sbalandi force-pushed the hook_tr5 branch from 4677506 to 6812278 Compare April 15, 2026 11:08

Copilot AI review requested due to automatic review settings April 15, 2026 11:10

sbalandi force-pushed the hook_tr5 branch from 6812278 to 577dfb3 Compare April 15, 2026 11:10

Copilot started reviewing on behalf of sbalandi April 15, 2026 11:11 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 15, 2026 12:17

Copilot started reviewing on behalf of sbalandi April 15, 2026 12:18 View session

sbalandi force-pushed the hook_tr5 branch from 58eaee1 to b0e2ab5 Compare April 15, 2026 12:23

Copilot AI reviewed Apr 15, 2026

View reviewed changes

sbalandi force-pushed the hook_tr5 branch from b0e2ab5 to 2d27a15 Compare April 15, 2026 14:22

sbalandi added 2 commits April 20, 2026 17:43

[llm_bench] Add hooks for transformers v5

b2c6468

ci check

a1f5855

Copilot AI review requested due to automatic review settings April 20, 2026 16:43

sbalandi force-pushed the hook_tr5 branch from d834d9b to a1f5855 Compare April 20, 2026 16:43

github-actions Bot added the category: whisper Whisper pipeline label Apr 20, 2026

Copilot started reviewing on behalf of sbalandi April 20, 2026 16:44 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 21, 2026 18:00

sbalandi force-pushed the hook_tr5 branch from 8201f0a to 75a624b Compare April 21, 2026 18:00

Copilot started reviewing on behalf of sbalandi April 21, 2026 18:01 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

sbalandi force-pushed the hook_tr5 branch from 75a624b to 0f9a1c3 Compare April 21, 2026 18:32

new

4e1deac

Copilot AI review requested due to automatic review settings April 21, 2026 22:52

sbalandi force-pushed the hook_tr5 branch from 0f9a1c3 to 4e1deac Compare April 21, 2026 22:52

Copilot started reviewing on behalf of sbalandi April 21, 2026 22:53 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

test

6de3438

Copilot AI review requested due to automatic review settings April 22, 2026 12:58

sbalandi force-pushed the hook_tr5 branch from 2a7bead to 6de3438 Compare April 22, 2026 12:58

Copilot started reviewing on behalf of sbalandi April 22, 2026 12:59 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Merge branch 'master' into hook_tr5

e6f6aa2

	if trans_version >= version.parse("5.0"):
	elif trans_version >= version.parse("5.0"):

		transformers[sentencepiece]>=4.35.2,<5.4.0
		sentence-transformers>=2.2.2,<=5.3.0

		@@ -10,7 +10,8 @@ torch>=2.1.0,<=2.11
		transformers[sentencepiece]>=4.40.0,<5.4.0

	transformers[sentencepiece]>=4.40.0,<5.4.0
	transformers[sentencepiece]>=4.40.0,<5.6.0

	optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
	optimum-intel==1.25.2

		# https://github.com/huggingface/optimum-intel/archive/2c48d6430c265ac259c1b264f3e2c4025cdd7b76.tar.gz#egg=optimum-intel
		optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5

		# optimum-intel[nncf]>=1.19.0,<=1.27.0
		optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5

	optimum-intel @ https://github.com/huggingface/optimum-intel.git@transformers-v5
	optimum-intel==1.27.0

	processor.pop("audio_tokenizer", None)
	if hasattr(processor, "audio_tokenizer"):
	delattr(processor, "audio_tokenizer")

		if Version(transformers_version) < "5.0.0":
		VISUAL_VIDEO_TEXT_MODELS += (("optimum-intel-internal-testing/tiny-random-llava-next-video", "visual-video-text"),)

-    if (
-        is_transformers_version(">=", "5.0")
-        and generation_config is not None
-        and generation_config.get("num_beam_groups", 1) > 1
+    num_beam_groups = 1
+    if generation_config is not None:
+        if isinstance(generation_config, dict):
+            num_beam_groups = generation_config.get("num_beam_groups", 1)
+        else:
+            num_beam_groups = getattr(generation_config, "num_beam_groups", 1)
+    if (
+        is_transformers_version(">=", "5.0")
+        and generation_config is not None
+        and num_beam_groups > 1

		assert False
		llm_pipe = openvino_genai.VLMPipeline(model_path, device.upper(), **ov_config)

Conversation

sbalandi commented Apr 13, 2026

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment