Skip to content

Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements#28386

Merged
tianleiwu merged 6 commits into
mainfrom
copilot/fix-cuda-cast-kernel-crash
May 19, 2026
Merged

Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements#28386
tianleiwu merged 6 commits into
mainfrom
copilot/fix-cuda-cast-kernel-crash

Conversation

Copilot AI commented May 6, 2026

Copy link
Copy Markdown
Contributor
  • Fix unary_elementwise_impl.cuh: Change CUDA_LONG to int64_t for N parameter and loop index in _UnaryElementWise kernel, and fix blocksPerGrid calculation
  • Fix cast_op.cu: Change CUDA_LONG to int64_t for N parameter and loop index in CastKernelStd, CastKernelSat, and CudaCastPairwiseKernel kernels, and remove static_cast<int> truncation
  • Use size_t for pair_count in CudaCastPairwise to avoid double conversion (review feedback)
  • Rename test to CastKernelCorrectness_ModerateSize and add CastKernel_Int64IndexArithmetic_NoOverflow host-side test (review feedback)
  • Merge from main to resolve conflicts with Float8E8M0 tests

Copilot AI and others added 2 commits May 6, 2026 20:35
…ors with >2^31 elements

Switch per-thread element index from CUDA_LONG (int32_t) to int64_t in:
- _UnaryElementWise kernel (cu_inc/unary_elementwise_impl.cuh)
- CastKernelStd kernel (tensor/cast_op.cu)
- CastKernelSat kernel (tensor/cast_op.cu)
- CudaCastPairwiseKernel (tensor/cast_op.cu)

Also fix the launch functions to pass element count as int64_t instead of
truncating via static_cast<int>, and fix blocksPerGrid calculation to
avoid int32 overflow in the intermediate multiplication.

Add regression test for large tensor cast.

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577

Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix CUDA Cast kernel illegal memory access on large tensors Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements May 6, 2026
Copilot AI requested a review from tianleiwu May 6, 2026 20:37

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Correct and well-scoped fix for a real int32 overflow bug in CUDA Cast and UnaryElementWise kernels. The changes consistently replace CUDA_LONG (int32_t) with int64_t across kernel parameters and index calculations, matching the same fix pattern applied to Gather in PR #28108.

Positives:

  • The static_cast<int64_t>(NumElementsPerThread) correctly anchors the multiplication chain in 64-bit arithmetic before multiplying with blockIdx.x, preventing intermediate overflow.
  • The unary_elementwise_impl.cuh header change propagates the fix to all unary elementwise ops (Abs, Neg, Sqrt, Log, Exp, Erf, etc.) in a single edit.
  • All three cast kernel variants (CastKernelStd, CastKernelSat, CudaCastPairwiseKernel) are consistently updated — no kernel was missed.
  • Removal of static_cast<int>(num_of_elements) truncation in the launch functions is the most important part, since that's where size_t → int32_t silently lost high bits.

Broader concern (out of scope): The CALCULATE_ELEMENTWISE_INDEX_OR_EXIT macro still uses CUDA_LONG and is used by dozens of other CUDA kernels (expand, tile, scatter_nd, resize, upsample, etc.) — they have the same int32 overflow vulnerability. Consider filing a follow-up issue to track the systemic fix.

Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Outdated
Comment thread onnxruntime/core/providers/cuda/tensor/cast_op.cu Outdated
@justinchuby

Copy link
Copy Markdown
Contributor

@copilot please address comments and rebase from main

…host-side overflow verification

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/d3f93dc0-5888-4bc0-a893-55294d96f191

Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot AI commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

@copilot please address comments and rebase from main

Addressed both review comments in f28a737. The rebase from main has a merge conflict in cast_op_test.cc (due to the Float8E8M0 tests added in #28435) that requires a force-push which I'm unable to perform with the available tooling. The conflict is only in the test file — the functional CUDA kernel changes apply cleanly to main.

@justinchuby

Copy link
Copy Markdown
Contributor

@copilot can you create a merge commit?

…-kernel-crash

# Conflicts:
#	onnxruntime/test/providers/cpu/tensor/cast_op_test.cc

Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot AI commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

@copilot can you create a merge commit?

Done in 081aa2f — merged origin/main into the branch, resolving the conflict in cast_op_test.cc (our tests now appear before the Float8E8M0 tests added in #28435).

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Outdated
@tianleiwu tianleiwu marked this pull request as ready for review May 18, 2026 22:25
@tianleiwu tianleiwu merged commit cd05ef4 into main May 19, 2026
88 checks passed
@tianleiwu tianleiwu deleted the copilot/fix-cuda-cast-kernel-crash branch May 19, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA Cast kernel crashes with illegal memory access on tensors with >2^31 elements (int32 overflow) — same family as #28107

3 participants