Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements by Copilot · Pull Request #28386 · microsoft/onnxruntime

Copilot · 2026-05-06T20:29:37Z

Fix unary_elementwise_impl.cuh: Change CUDA_LONG to int64_t for N parameter and loop index in _UnaryElementWise kernel, and fix blocksPerGrid calculation
Fix cast_op.cu: Change CUDA_LONG to int64_t for N parameter and loop index in CastKernelStd, CastKernelSat, and CudaCastPairwiseKernel kernels, and remove static_cast<int> truncation
Use size_t for pair_count in CudaCastPairwise to avoid double conversion (review feedback)
Rename test to CastKernelCorrectness_ModerateSize and add CastKernel_Int64IndexArithmetic_NoOverflow host-side test (review feedback)
Merge from main to resolve conflicts with Float8E8M0 tests

…ors with >2^31 elements Switch per-thread element index from CUDA_LONG (int32_t) to int64_t in: - _UnaryElementWise kernel (cu_inc/unary_elementwise_impl.cuh) - CastKernelStd kernel (tensor/cast_op.cu) - CastKernelSat kernel (tensor/cast_op.cu) - CudaCastPairwiseKernel (tensor/cast_op.cu) Also fix the launch functions to pass element count as int64_t instead of truncating via static_cast<int>, and fix blocksPerGrid calculation to avoid int32 overflow in the intermediate multiplication. Add regression test for large tensor cast. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

tianleiwu

Review Summary

Correct and well-scoped fix for a real int32 overflow bug in CUDA Cast and UnaryElementWise kernels. The changes consistently replace CUDA_LONG (int32_t) with int64_t across kernel parameters and index calculations, matching the same fix pattern applied to Gather in PR #28108.

Positives:

The static_cast<int64_t>(NumElementsPerThread) correctly anchors the multiplication chain in 64-bit arithmetic before multiplying with blockIdx.x, preventing intermediate overflow.
The unary_elementwise_impl.cuh header change propagates the fix to all unary elementwise ops (Abs, Neg, Sqrt, Log, Exp, Erf, etc.) in a single edit.
All three cast kernel variants (CastKernelStd, CastKernelSat, CudaCastPairwiseKernel) are consistently updated — no kernel was missed.
Removal of static_cast<int>(num_of_elements) truncation in the launch functions is the most important part, since that's where size_t → int32_t silently lost high bits.

Broader concern (out of scope): The CALCULATE_ELEMENTWISE_INDEX_OR_EXIT macro still uses CUDA_LONG and is used by dozens of other CUDA kernels (expand, tile, scatter_nd, resize, upsample, etc.) — they have the same int32 overflow vulnerability. Consider filing a follow-up issue to track the systemic fix.

justinchuby · 2026-05-18T19:18:18Z

@copilot please address comments and rebase from main

…host-side overflow verification Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/d3f93dc0-5888-4bc0-a893-55294d96f191 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot · 2026-05-18T19:26:52Z

@copilot please address comments and rebase from main

Addressed both review comments in f28a737. The rebase from main has a merge conflict in cast_op_test.cc (due to the Float8E8M0 tests added in #28435) that requires a force-push which I'm unable to perform with the available tooling. The conflict is only in the test file — the functional CUDA kernel changes apply cleanly to main.

justinchuby · 2026-05-18T19:27:33Z

@copilot can you create a merge commit?

…-kernel-crash # Conflicts: # onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot · 2026-05-18T19:31:51Z

@copilot can you create a merge commit?

Done in 081aa2f — merged origin/main into the branch, resolving the conflict in cast_op_test.cc (our tests now appear before the Float8E8M0 tests added in #28435).

github-actions

You can commit the suggested changes from lintrunner.

Initial plan

3ace5f3

Copilot AI assigned Copilot and tianleiwu May 6, 2026

Copilot started work on behalf of tianleiwu May 6, 2026 20:29 View session

Copilot AI linked an issue May 6, 2026 that may be closed by this pull request

CUDA Cast kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow) — same family as #28107 #28385

Closed

Copilot AI and others added 2 commits May 6, 2026 20:35

Fix placeholder issue number in test comment

92ac4fa

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix CUDA Cast kernel illegal memory access on large tensors~~ Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements May 6, 2026

Copilot finished work on behalf of tianleiwu May 6, 2026 20:37

Copilot AI requested a review from tianleiwu May 6, 2026 20:37

tianleiwu reviewed May 7, 2026

View reviewed changes

Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Outdated

Comment thread onnxruntime/core/providers/cuda/tensor/cast_op.cu Outdated

Copilot started work on behalf of justinchuby May 18, 2026 19:18 View session

Copilot finished work on behalf of justinchuby May 18, 2026 19:27

Copilot AI requested a review from justinchuby May 18, 2026 19:27

Copilot started work on behalf of justinchuby May 18, 2026 19:27 View session

Merge remote-tracking branch 'origin/main' into copilot/fix-cuda-cast…

081aa2f

…-kernel-crash # Conflicts: # onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot finished work on behalf of justinchuby May 18, 2026 19:33

github-actions Bot reviewed May 18, 2026

View reviewed changes

Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Outdated

check blocksPerGrid

ad13c0e

tianleiwu marked this pull request as ready for review May 18, 2026 22:25

justinchuby approved these changes May 18, 2026

View reviewed changes

tianleiwu merged commit cd05ef4 into main May 19, 2026
88 checks passed

tianleiwu deleted the copilot/fix-cuda-cast-kernel-crash branch May 19, 2026 17:56

BrewTestBot mentioned this pull request Jun 19, 2026

onnxruntime 1.27.0 Homebrew/homebrew-core#288892

Merged

dependabot Bot mentioned this pull request Jun 20, 2026

Bump Microsoft.ML.OnnxRuntime from 1.25.0 to 1.27.0 dvai-global/dvai-bridge#47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements#28386

Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements#28386
tianleiwu merged 6 commits into
mainfrom
copilot/fix-cuda-cast-kernel-crash

Copilot AI commented May 6, 2026 •

edited

Loading

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

Uh oh!

justinchuby commented May 18, 2026

Uh oh!

Copilot AI commented May 18, 2026

Uh oh!

justinchuby commented May 18, 2026

Uh oh!

Copilot AI commented May 18, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Uh oh!

justinchuby commented May 18, 2026

Uh oh!

Copilot AI commented May 18, 2026

Uh oh!

justinchuby commented May 18, 2026

Uh oh!

Copilot AI commented May 18, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 6, 2026 •

edited

Loading