Properly handle per request seed for multinomial sampling#3721
Properly handle per request seed for multinomial sampling#3721mzegla wants to merge 12 commits intoopenvinotoolkit:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes RNG seeding for multinomial sampling by moving from a single sampler-level RNG seed to per-request RNG engines (important for continuous batching), so GenerationConfig.rng_seed is honored per request and reproducibility works as expected.
Changes:
- Refactor C++ sampler to keep per-request sampling state (including
std::mt19937) in a request context map and pass RNG explicitly to multinomial sampling/validation paths. - Stop seeding the sampler globally during continuous batching pipeline initialization.
- Add Python tests covering same-seed reproducibility and different-seed variability; add RNG seed CLI support to multinomial text-generation samples.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python_tests/test_llm_pipeline.py | Adds RNG seed reproducibility/difference tests across pipeline types. |
| tests/python_tests/test_continuous_batching.py | Adds continuous batching add_request/step tests for RNG seed behavior. |
| src/cpp/src/sampling/sampler.hpp | Introduces per-request sampler context (RNG/logit processor/stop strings) and updates method signatures. |
| src/cpp/src/sampling/sampler.cpp | Uses per-request contexts during sampling; routes RNG through multinomial sampling and validation. |
| src/cpp/src/continuous_batching/pipeline_impl.cpp | Removes global sampler seeding at CB pipeline init. |
| samples/python/text_generation/multinomial_causal_lm.py | Adds optional --rng_seed argument and validation. |
| samples/cpp/text_generation/multinomial_causal_lm.cpp | Adds optional RNG seed CLI arg parsing and sets config.rng_seed. |
| int main(int argc, char* argv[]) try { | ||
| if (3 != argc) { | ||
| throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> '<PROMPT>'"); | ||
| if (argc < 3 || argc > 4) { |
There was a problem hiding this comment.
We do not expose other config properties as arguments. Let's revert seed argument and keep samples simple
There was a problem hiding this comment.
The expectation is that with multinomial algorithm, responses should differ slightly each time. I guess @mzegla is trying to showcase the difference that way but the correct behavior should be to have random default seed.
If we have fixed static seed, the point of multinomial is not met.
I propose to drop this parameter from sample but set random seed by default.
There was a problem hiding this comment.
Makes sense to me. @as-suvorov , @apaniukov - what do you think?
There was a problem hiding this comment.
@dtrawins Do you mean to set random seed to GenerationConfig?
There was a problem hiding this comment.
@as-suvorov yes, the seed should be random by default. So it would be the best to set random value in the constructor by default. That would achieve correct behavior of multinomial algorithm and give option to get deterministic responses with a static seed value in the user request.
Potentially that might impact unit tests but this is needed. Open question is if that should be done in the same PR or separate. I think it could be one as it states in title "Properly handle per request seed for multinomial sampling" ;-)
There was a problem hiding this comment.
llama.cpp and vllm use non-deterministic generation by default and getting deterministic sequence requires setting the seed explicitly
How shall we proceed?
There was a problem hiding this comment.
It's fine by me. @Wovchena @apaniukov Do you have any objections?
There was a problem hiding this comment.
Is transformers and llama.cpp alignment the only reason? A counter reason would be not to suddenly change GenAI's default people could already get used to.
@likholat, can you describe the behavior of random generators for image generation?
There was a problem hiding this comment.
From my perspective it's more of a misconfiguration that is now being fixed.
I would say that general expectation when using random sampling is to get different outputs for exactly the same prompt by default (#3584) and deterministic behavior only on-demand when setting specific seed.
There was a problem hiding this comment.
Pushed few commits with default random seed changed. Let me know what the final decision is.
If there are objections I will revert it and only keep the fix for correct seed usage.
| # Speculative decoding and prompt lookup require extra GenerationConfig params; | ||
| # exclude them from generic RNG-seed tests. | ||
| _NON_ASSISTANT_PIPELINE_TYPES = tuple( | ||
| pt | ||
| for pt in ALL_PIPELINE_TYPES | ||
| if pt not in (PipelineType.SPECULATIVE_DECODING, PipelineType.PROMPT_LOOKUP_DECODING) | ||
| ) |
There was a problem hiding this comment.
The PR description doesn’t appear to follow the repository PR template (missing the checklist items like tests/docs/ticket coverage). Please update the PR description to include the template sections and mark the checklist items appropriately so reviewers can verify completion (see .github/pull_request_template.md).
| RequestSamplerContext(size_t seed, LogitProcessor&& lp) | ||
| : rng_engine(seed), logit_processor(std::move(lp)) {} |
There was a problem hiding this comment.
RequestSamplerContext seeds std::mt19937 with a size_t. std::mt19937 only uses a 32-bit seed, so high bits of rng_seed are silently truncated (e.g., seeds that differ by 2^32 become identical), which can surprise users and reduce entropy for RANDOM_SEED-derived values. Consider seeding via std::seed_seq (splitting the size_t into 32-bit parts) or switching to std::mt19937_64 if you want rng_seed to be fully respected.
| # Cross-comparison is skipped: without a fixed rng_seed each run produces | ||
| # non-deterministic output, so C++, Python, and JS results will differ. |
There was a problem hiding this comment.
This test now runs the C++/Python/JS multinomial samples but no longer asserts anything about their outputs, which significantly reduces the value of the sample test. Since deterministic behavior is still supported via GenerationConfig.rng_seed, consider setting a fixed rng_seed in the samples (or otherwise controlling it in the test) and restoring cross-comparison (or at least asserting basic expectations like non-empty output) so the test continues to catch regressions.
| // std::random_device::entropy() == 0 means it is a PRNG, not a true RNG. | ||
| static const bool is_rd_prng = std::random_device{}.entropy() == 0; | ||
| if (is_rd_prng) { | ||
| return static_cast<size_t>(std::chrono::steady_clock::now().time_since_epoch().count()); | ||
| } |
There was a problem hiding this comment.
resolve_rng_seed(): when std::random_device::entropy()==0, the fallback seeds from steady_clock ticks, which can collide for multiple requests created in the same tick (e.g., batched prompts / continuous batching), undermining the promise of a distinct non-deterministic seed per request. Consider mixing in an atomic counter / request_id (or still consuming std::random_device() even if entropy()==0) and/or switching to system_clock as stated in the comment so seeds are less likely to repeat across runs.
Accept per request seed for multinomial sampling instead of initializing default seed on pipeline initialization (continuous batching).
Addressing: #3584