Add per-stage performance metrics for WhisperPipeline and sampling duration for LLMPipeline by goyaladitya05 · Pull Request #3669 · openvinotoolkit/openvino.genai

goyaladitya05 · 2026-04-08T11:50:47Z

Description

Enables llm_bench to report separate latencies for each WhisperPipeline stage and adds sampling duration tracking across all LLM pipeline backends.

Changes

PerfMetrics: added m_sampling_durations and get_sampling_duration(), collected for WhisperPipeline, static LLM, continuous batching, and SDPA pipelines
WhisperPerfMetrics: added encode_inference_durations / decode_inference_durations and get_encode_inference_duration / get_decode_inference_duration
llm_bench: reports per-stage Whisper latencies for GenAI pipelines:
1. First token: tokenization, feature extraction, encode, decode
2. Second token: decode
3. Other tokens: decode avg
4. Sampling: single avg across all tokens
5. Detokenization
llm_bench hooks: added tm_sample_list to hook_greedy_search.py and all llm_hook_sample/hook_sample_v*.py files to time only the argmax/multinomial step.
test_perf_metrics in test_whisper_pipeline.py and test_llm_pipeline.py asserts new metrics are collected and consistent with their counters

Built locally and verified.

Closes: #3320

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code.
This PR fully addresses the ticket.
I have made corresponding changes to the documentation. NA, could not find any documentation for this. Updated the docstrings wherever applicable

Copilot

Pull request overview

This PR extends GenAI performance instrumentation by adding sampling-stage duration tracking to common PerfMetrics, and adds encoder/decoder inference durations to WhisperPerfMetrics, then updates llm_bench and Python bindings/tests to surface and validate these metrics.

Changes:

Added RawPerfMetrics::m_sampling_durations + PerfMetrics::get_sampling_duration() and collected sampling timings across static LLM, SDPA, continuous batching, and Whisper pipelines.
Added WhisperRawPerfMetrics::{encode_inference_durations, decode_inference_durations} + corresponding WhisperPerfMetrics getters/statistics.
Updated llm_bench to print per-stage Whisper latencies and extended Python tests to validate the new metrics.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tools/llm_bench/task/speech_to_text_generation.py	Extracts Whisper per-stage metrics (tokenization/features/encode/decode/sampling/detokenization) for reporting.
tools/llm_bench/llm_bench_utils/metrics_print.py	Adds `whisper_genai` reporting path and a new per-stage Whisper latency printer.
tests/python_tests/test_whisper_pipeline.py	Extends Whisper perf metrics test to assert encode/decode/sampling metrics are present and consistent with raw counters.
tests/python_tests/test_llm_pipeline.py	Extends LLM perf metrics test to validate sampling duration statistics vs raw counters.
src/python/py_whisper_pipeline.cpp	Exposes new Whisper raw metrics fields + new WhisperPerfMetrics getters to Python.
src/python/py_perf_metrics.cpp	Exposes `sampling_durations` and `get_sampling_duration()` to Python.
src/python/openvino_genai/py_openvino_genai.pyi	Updates Python stubs for new perf metrics APIs/properties.
src/cpp/src/whisper/whisper_utils.hpp	Adds helpers to record extra per-infer durations and to filter additional per-token metrics.
src/cpp/src/whisper/whisper_utils.cpp	Implements new helpers (but currently contains a duplicate function definition causing a compile error).
src/cpp/src/whisper/pipeline_static.cpp	Collects encode/decode inference durations and sampling durations in Whisper static pipeline.
src/cpp/src/whisper/perf_metrics.cpp	Computes mean/std for new Whisper encode/decode inference duration metrics and merges them in operator+.
src/cpp/src/perf_metrics.cpp	Computes sampling duration statistics and concatenates sampling durations in `PerfMetrics::operator+`.
src/cpp/src/lm_encoding.cpp	Tracks sampling duration around `sampler.sample()` in SDPA backend.
src/cpp/src/llm/pipeline_static.cpp	Tracks sampling duration around `m_sampler.sample()` in static LLM pipeline.
src/cpp/src/continuous_batching/pipeline_impl.cpp	Records sampling duration per step and stores it into raw perf counters.
src/cpp/include/openvino/genai/whisper_pipeline.hpp	Adds new Whisper raw perf counters and WhisperPerfMetrics getters/fields.
src/cpp/include/openvino/genai/perf_metrics.hpp	Adds sampling durations to raw metrics and exposes `get_sampling_duration()`.
src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp	Extends pipeline metrics with per-step sampling duration for continuous batching.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

goyaladitya05 · 2026-04-08T19:15:28Z

cc @sbalandi

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-11T07:33:21Z

    :param grammar_compile_times: Time to compile the grammar in milliseconds.
    :type grammar_compile_times: list[float]
+
+    :param sampling_durations: Time spent in the sampler per sampling step in microseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.
+    :type sampling_durations: list[float]
 )";


common_bindings::utils::get_ms() (in src/bindings_utils.hpp) returns duration.count() for MicroSeconds, i.e., raw values are exposed in microseconds. Since this docstring block is being updated, please align the units for all raw duration lists in this docstring (many currently say “milliseconds”) or change the binding helper to actually convert to ms to avoid misleading Python docs.

It's a pre-existing issue in the file. I can address it, but then it should be done for the entire file for consistency, and to be done in a seperate PR.

@sbalandi Do i need to address this? I would do this for the entire file in a seperate PR.

sbalandi · 2026-04-14T09:10:35Z

@as-suvorov @eshiryae could you please take a look on whisper side ?

goyaladitya05 · 2026-04-14T15:34:22Z

@sbalandi Could you please re-run the failing checks? It's not releted to my changes.

…enai metrics

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-19T06:10:56Z

        :param grammar_compile_times: Time to compile the grammar in milliseconds.
        :type grammar_compile_times: list[float]
+
+        :param sampling_durations: Time spent in the sampler per sampling step in milliseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.


RawPerfMetrics.sampling_durations is documented here as milliseconds, but the pybind helper common_bindings::utils::get_ms returns MicroSeconds::count() without dividing by 1000, so the exposed list is in microseconds. Please adjust the stub docstring (or the binding conversion) so the units match the actual values users see in Python.

Suggested change

:param sampling_durations: Time spent in the sampler per sampling step in milliseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.

:param sampling_durations: Time spent in the sampler per sampling step in microseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.

#3669 (comment)

goyaladitya05 added 12 commits April 7, 2026 19:57

Add sampling_duration metric to RawPerfMetrics and PerfMetrics

89e41eb

sampling duration in continuous_batching pipeline

281ee13

Add sampling/encode/decode duration Python bindings

b51ad70

Report per stage Whisper latencies in llm_bench

69e3cb3

Add encode/decode inference duration metrics to WhisperPerfMetrics

bd08f41

Whisper encode/decode/sampling durations in pipeline_static

880d8f3

sampling duration in lm_encoding.cpp

e7c8d51

sampling duration in llm/pipeline_static.cpp

d8e0c79

Update stub files

099de98

Fixed issues

f1dc144

Add tests for encode/decode/sampling duration metrics

6941088

Remove unnecessary comments

89f5d6c

Copilot AI review requested due to automatic review settings April 8, 2026 11:50

Copilot started reviewing on behalf of goyaladitya05 April 8, 2026 11:51 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread src/cpp/src/whisper/whisper_utils.cpp Outdated

Comment thread src/cpp/include/openvino/genai/perf_metrics.hpp

goyaladitya05 added 2 commits April 8, 2026 17:27

Remove duplicates

affda01

Fix code style

2aa5ef0

Copilot AI review requested due to automatic review settings April 8, 2026 12:03

Copilot started reviewing on behalf of goyaladitya05 April 8, 2026 12:03 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread src/cpp/src/whisper/pipeline_static.cpp

Comment thread src/cpp/include/openvino/genai/whisper_pipeline.hpp

address review comments

4261f8b

Copilot AI review requested due to automatic review settings April 8, 2026 19:15

Copilot started reviewing on behalf of goyaladitya05 April 8, 2026 19:16 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread src/cpp/src/llm/pipeline_static.cpp Outdated

Remove initial sampling duration

d6262ab

sbalandi reviewed Apr 10, 2026

View reviewed changes

Comment thread tools/llm_bench/task/speech_to_text_generation.py

goyaladitya05 added 3 commits April 10, 2026 23:42

Add sampling latency to WhisperHook

a197da3

Update stub files

3aeca57

Update stub files

40d4507

Copilot AI review requested due to automatic review settings April 11, 2026 07:28

Copilot started reviewing on behalf of goyaladitya05 April 11, 2026 07:28 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

goyaladitya05 added 2 commits April 11, 2026 13:18

Address review comments

d032a97

Update stub files

be7aeb3

goyaladitya05 requested a review from sbalandi April 11, 2026 11:10

sbalandi requested a review from eshiryae April 14, 2026 09:09

sbalandi reviewed Apr 17, 2026

View reviewed changes

Comment thread tools/llm_bench/llm_bench_utils/hook_forward_whisper.py Outdated

Comment thread tools/llm_bench/task/speech_to_text_generation.py Outdated

goyaladitya05 added 2 commits April 17, 2026 22:17

Use mean sampling latency instead of per-token breakdown in whisper g…

6aec841

…enai metrics

Add sampling time tracking to greedy search and sample hooks

2a52ab9

Copilot AI review requested due to automatic review settings April 19, 2026 05:53

Copilot started reviewing on behalf of goyaladitya05 April 19, 2026 05:53 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread tools/llm_bench/llm_bench_utils/hook_forward_whisper.py

Comment thread tools/llm_bench/task/speech_to_text_generation.py

goyaladitya05 and others added 2 commits April 19, 2026 11:33

add guard

09474aa

Merge branch 'master' into feature/whisper-sampling-perf-metrics

95eff68

Copilot AI review requested due to automatic review settings April 19, 2026 06:05

Copilot started reviewing on behalf of goyaladitya05 April 19, 2026 06:06 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

goyaladitya05 requested a review from sbalandi April 19, 2026 06:18

	:param sampling_durations: Time spent in the sampler per sampling step in milliseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.
	:param sampling_durations: Time spent in the sampler per sampling step in microseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.

Conversation

goyaladitya05 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

goyaladitya05 commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

goyaladitya05 Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goyaladitya05 Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbalandi commented Apr 14, 2026

Uh oh!

goyaladitya05 commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

goyaladitya05 Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goyaladitya05 commented Apr 8, 2026 •

edited

Loading

goyaladitya05 Apr 11, 2026 •

edited

Loading

goyaladitya05 Apr 14, 2026 •

edited

Loading