Skip to content

Add per-stage performance metrics for WhisperPipeline and sampling duration for LLMPipeline#3669

Open
goyaladitya05 wants to merge 29 commits intoopenvinotoolkit:masterfrom
goyaladitya05:feature/whisper-sampling-perf-metrics
Open

Add per-stage performance metrics for WhisperPipeline and sampling duration for LLMPipeline#3669
goyaladitya05 wants to merge 29 commits intoopenvinotoolkit:masterfrom
goyaladitya05:feature/whisper-sampling-perf-metrics

Conversation

@goyaladitya05
Copy link
Copy Markdown
Contributor

@goyaladitya05 goyaladitya05 commented Apr 8, 2026

Description

Enables llm_bench to report separate latencies for each WhisperPipeline stage and adds sampling duration tracking across all LLM pipeline backends.

Changes

  • PerfMetrics: added m_sampling_durations and get_sampling_duration(), collected for WhisperPipeline, static LLM, continuous batching, and SDPA pipelines
  • WhisperPerfMetrics: added encode_inference_durations / decode_inference_durations and get_encode_inference_duration / get_decode_inference_duration
  • llm_bench: reports per-stage Whisper latencies for GenAI pipelines:
    1. First token: tokenization, feature extraction, encode, decode
    2. Second token: decode
    3. Other tokens: decode avg
    4. Sampling: single avg across all tokens
    5. Detokenization
  • llm_bench hooks: added tm_sample_list to hook_greedy_search.py and all llm_hook_sample/hook_sample_v*.py files to time only the argmax/multinomial step.
  • test_perf_metrics in test_whisper_pipeline.py and test_llm_pipeline.py asserts new metrics are collected and consistent with their counters

Built locally and verified.

Closes: #3320

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation. NA, could not find any documentation for this. Updated the docstrings wherever applicable

Copilot AI review requested due to automatic review settings April 8, 2026 11:50
@github-actions github-actions Bot added category: llm_bench Label for tool/llm_bench folder category: visual language Visual language pipeline category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: whisper Whisper pipeline category: Python API Python API for GenAI category: CPP API Changes in GenAI C++ public headers no-match-files category: GGUF GGUF file reader labels Apr 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends GenAI performance instrumentation by adding sampling-stage duration tracking to common PerfMetrics, and adds encoder/decoder inference durations to WhisperPerfMetrics, then updates llm_bench and Python bindings/tests to surface and validate these metrics.

Changes:

  • Added RawPerfMetrics::m_sampling_durations + PerfMetrics::get_sampling_duration() and collected sampling timings across static LLM, SDPA, continuous batching, and Whisper pipelines.
  • Added WhisperRawPerfMetrics::{encode_inference_durations, decode_inference_durations} + corresponding WhisperPerfMetrics getters/statistics.
  • Updated llm_bench to print per-stage Whisper latencies and extended Python tests to validate the new metrics.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/llm_bench/task/speech_to_text_generation.py Extracts Whisper per-stage metrics (tokenization/features/encode/decode/sampling/detokenization) for reporting.
tools/llm_bench/llm_bench_utils/metrics_print.py Adds whisper_genai reporting path and a new per-stage Whisper latency printer.
tests/python_tests/test_whisper_pipeline.py Extends Whisper perf metrics test to assert encode/decode/sampling metrics are present and consistent with raw counters.
tests/python_tests/test_llm_pipeline.py Extends LLM perf metrics test to validate sampling duration statistics vs raw counters.
src/python/py_whisper_pipeline.cpp Exposes new Whisper raw metrics fields + new WhisperPerfMetrics getters to Python.
src/python/py_perf_metrics.cpp Exposes sampling_durations and get_sampling_duration() to Python.
src/python/openvino_genai/py_openvino_genai.pyi Updates Python stubs for new perf metrics APIs/properties.
src/cpp/src/whisper/whisper_utils.hpp Adds helpers to record extra per-infer durations and to filter additional per-token metrics.
src/cpp/src/whisper/whisper_utils.cpp Implements new helpers (but currently contains a duplicate function definition causing a compile error).
src/cpp/src/whisper/pipeline_static.cpp Collects encode/decode inference durations and sampling durations in Whisper static pipeline.
src/cpp/src/whisper/perf_metrics.cpp Computes mean/std for new Whisper encode/decode inference duration metrics and merges them in operator+.
src/cpp/src/perf_metrics.cpp Computes sampling duration statistics and concatenates sampling durations in PerfMetrics::operator+.
src/cpp/src/lm_encoding.cpp Tracks sampling duration around sampler.sample() in SDPA backend.
src/cpp/src/llm/pipeline_static.cpp Tracks sampling duration around m_sampler.sample() in static LLM pipeline.
src/cpp/src/continuous_batching/pipeline_impl.cpp Records sampling duration per step and stores it into raw perf counters.
src/cpp/include/openvino/genai/whisper_pipeline.hpp Adds new Whisper raw perf counters and WhisperPerfMetrics getters/fields.
src/cpp/include/openvino/genai/perf_metrics.hpp Adds sampling durations to raw metrics and exposes get_sampling_duration().
src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp Extends pipeline metrics with per-step sampling duration for continuous batching.

Comment thread src/cpp/src/whisper/whisper_utils.cpp Outdated
Comment thread src/cpp/include/openvino/genai/perf_metrics.hpp
Copilot AI review requested due to automatic review settings April 8, 2026 12:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comment thread src/cpp/src/whisper/pipeline_static.cpp
Comment thread src/cpp/include/openvino/genai/whisper_pipeline.hpp
Copilot AI review requested due to automatic review settings April 8, 2026 19:15
@goyaladitya05
Copy link
Copy Markdown
Contributor Author

cc @sbalandi

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Comment thread src/cpp/src/llm/pipeline_static.cpp Outdated
Comment thread tools/llm_bench/task/speech_to_text_generation.py
Copilot AI review requested due to automatic review settings April 11, 2026 07:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment thread tools/llm_bench/llm_bench_utils/hook_forward_whisper.py Outdated
Comment thread tools/llm_bench/llm_bench_utils/hook_forward_whisper.py Outdated
Comment thread tools/llm_bench/llm_bench_utils/hook_forward_whisper.py Outdated
Comment on lines 60 to 65
:param grammar_compile_times: Time to compile the grammar in milliseconds.
:type grammar_compile_times: list[float]

:param sampling_durations: Time spent in the sampler per sampling step in microseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.
:type sampling_durations: list[float]
)";
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common_bindings::utils::get_ms() (in src/bindings_utils.hpp) returns duration.count() for MicroSeconds, i.e., raw values are exposed in microseconds. Since this docstring block is being updated, please align the units for all raw duration lists in this docstring (many currently say “milliseconds”) or change the binding helper to actually convert to ms to avoid misleading Python docs.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@goyaladitya05 goyaladitya05 Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pre-existing issue in the file. I can address it, but then it should be done for the entire file for consistency, and to be done in a seperate PR.

Copy link
Copy Markdown
Contributor Author

@goyaladitya05 goyaladitya05 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbalandi Do i need to address this? I would do this for the entire file in a seperate PR.

@goyaladitya05 goyaladitya05 requested a review from sbalandi April 11, 2026 11:10
@sbalandi sbalandi requested a review from eshiryae April 14, 2026 09:09
@sbalandi
Copy link
Copy Markdown
Contributor

@as-suvorov @eshiryae could you please take a look on whisper side ?

@goyaladitya05
Copy link
Copy Markdown
Contributor Author

@sbalandi Could you please re-run the failing checks? It's not releted to my changes.

Comment thread tools/llm_bench/llm_bench_utils/hook_forward_whisper.py Outdated
Comment thread tools/llm_bench/task/speech_to_text_generation.py Outdated
Copilot AI review requested due to automatic review settings April 19, 2026 05:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Comment thread tools/llm_bench/llm_bench_utils/hook_forward_whisper.py
Comment thread tools/llm_bench/task/speech_to_text_generation.py
Copilot AI review requested due to automatic review settings April 19, 2026 06:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Comment thread tools/llm_bench/llm_bench_utils/metrics_print.py
:param grammar_compile_times: Time to compile the grammar in milliseconds.
:type grammar_compile_times: list[float]

:param sampling_durations: Time spent in the sampler per sampling step in milliseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawPerfMetrics.sampling_durations is documented here as milliseconds, but the pybind helper common_bindings::utils::get_ms returns MicroSeconds::count() without dividing by 1000, so the exposed list is in microseconds. Please adjust the stub docstring (or the binding conversion) so the units match the actual values users see in Python.

Suggested change
:param sampling_durations: Time spent in the sampler per sampling step in milliseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.
:param sampling_durations: Time spent in the sampler per sampling step in microseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goyaladitya05 goyaladitya05 requested a review from sbalandi April 19, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: llm_bench Label for tool/llm_bench folder category: LLM LLM pipeline (stateful, static) category: Python API Python API for GenAI category: visual language Visual language pipeline category: whisper Whisper pipeline no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand WhisperPipeline performance metrics and add support in llm_bench

3 participants