Add per-stage performance metrics for WhisperPipeline and sampling duration for LLMPipeline#3669
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends GenAI performance instrumentation by adding sampling-stage duration tracking to common PerfMetrics, and adds encoder/decoder inference durations to WhisperPerfMetrics, then updates llm_bench and Python bindings/tests to surface and validate these metrics.
Changes:
- Added
RawPerfMetrics::m_sampling_durations+PerfMetrics::get_sampling_duration()and collected sampling timings across static LLM, SDPA, continuous batching, and Whisper pipelines. - Added
WhisperRawPerfMetrics::{encode_inference_durations, decode_inference_durations}+ correspondingWhisperPerfMetricsgetters/statistics. - Updated
llm_benchto print per-stage Whisper latencies and extended Python tests to validate the new metrics.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/llm_bench/task/speech_to_text_generation.py | Extracts Whisper per-stage metrics (tokenization/features/encode/decode/sampling/detokenization) for reporting. |
| tools/llm_bench/llm_bench_utils/metrics_print.py | Adds whisper_genai reporting path and a new per-stage Whisper latency printer. |
| tests/python_tests/test_whisper_pipeline.py | Extends Whisper perf metrics test to assert encode/decode/sampling metrics are present and consistent with raw counters. |
| tests/python_tests/test_llm_pipeline.py | Extends LLM perf metrics test to validate sampling duration statistics vs raw counters. |
| src/python/py_whisper_pipeline.cpp | Exposes new Whisper raw metrics fields + new WhisperPerfMetrics getters to Python. |
| src/python/py_perf_metrics.cpp | Exposes sampling_durations and get_sampling_duration() to Python. |
| src/python/openvino_genai/py_openvino_genai.pyi | Updates Python stubs for new perf metrics APIs/properties. |
| src/cpp/src/whisper/whisper_utils.hpp | Adds helpers to record extra per-infer durations and to filter additional per-token metrics. |
| src/cpp/src/whisper/whisper_utils.cpp | Implements new helpers (but currently contains a duplicate function definition causing a compile error). |
| src/cpp/src/whisper/pipeline_static.cpp | Collects encode/decode inference durations and sampling durations in Whisper static pipeline. |
| src/cpp/src/whisper/perf_metrics.cpp | Computes mean/std for new Whisper encode/decode inference duration metrics and merges them in operator+. |
| src/cpp/src/perf_metrics.cpp | Computes sampling duration statistics and concatenates sampling durations in PerfMetrics::operator+. |
| src/cpp/src/lm_encoding.cpp | Tracks sampling duration around sampler.sample() in SDPA backend. |
| src/cpp/src/llm/pipeline_static.cpp | Tracks sampling duration around m_sampler.sample() in static LLM pipeline. |
| src/cpp/src/continuous_batching/pipeline_impl.cpp | Records sampling duration per step and stores it into raw perf counters. |
| src/cpp/include/openvino/genai/whisper_pipeline.hpp | Adds new Whisper raw perf counters and WhisperPerfMetrics getters/fields. |
| src/cpp/include/openvino/genai/perf_metrics.hpp | Adds sampling durations to raw metrics and exposes get_sampling_duration(). |
| src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp | Extends pipeline metrics with per-step sampling duration for continuous batching. |
|
cc @sbalandi |
| :param grammar_compile_times: Time to compile the grammar in milliseconds. | ||
| :type grammar_compile_times: list[float] | ||
|
|
||
| :param sampling_durations: Time spent in the sampler per sampling step in microseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes. | ||
| :type sampling_durations: list[float] | ||
| )"; |
There was a problem hiding this comment.
common_bindings::utils::get_ms() (in src/bindings_utils.hpp) returns duration.count() for MicroSeconds, i.e., raw values are exposed in microseconds. Since this docstring block is being updated, please align the units for all raw duration lists in this docstring (many currently say “milliseconds”) or change the binding helper to actually convert to ms to avoid misleading Python docs.
There was a problem hiding this comment.
It's a pre-existing issue in the file. I can address it, but then it should be done for the entire file for consistency, and to be done in a seperate PR.
There was a problem hiding this comment.
@sbalandi Do i need to address this? I would do this for the entire file in a seperate PR.
|
@as-suvorov @eshiryae could you please take a look on whisper side ? |
|
@sbalandi Could you please re-run the failing checks? It's not releted to my changes. |
| :param grammar_compile_times: Time to compile the grammar in milliseconds. | ||
| :type grammar_compile_times: list[float] | ||
|
|
||
| :param sampling_durations: Time spent in the sampler per sampling step in milliseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes. |
There was a problem hiding this comment.
RawPerfMetrics.sampling_durations is documented here as milliseconds, but the pybind helper common_bindings::utils::get_ms returns MicroSeconds::count() without dividing by 1000, so the exposed list is in microseconds. Please adjust the stub docstring (or the binding conversion) so the units match the actual values users see in Python.
| :param sampling_durations: Time spent in the sampler per sampling step in milliseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes. | |
| :param sampling_durations: Time spent in the sampler per sampling step in microseconds. One entry per sampler.sample() call, parallel to token_infer_durations and m_batch_sizes. |
There was a problem hiding this comment.
Description
Enables llm_bench to report separate latencies for each WhisperPipeline stage and adds sampling duration tracking across all LLM pipeline backends.
Changes
Built locally and verified.
Closes: #3320
Checklist: