Fix int32 overflow in CUDA Cast and UnaryElementWise kernels for tensors with >2^31 elements#28386
Conversation
…ors with >2^31 elements Switch per-thread element index from CUDA_LONG (int32_t) to int64_t in: - _UnaryElementWise kernel (cu_inc/unary_elementwise_impl.cuh) - CastKernelStd kernel (tensor/cast_op.cu) - CastKernelSat kernel (tensor/cast_op.cu) - CudaCastPairwiseKernel (tensor/cast_op.cu) Also fix the launch functions to pass element count as int64_t instead of truncating via static_cast<int>, and fix blocksPerGrid calculation to avoid int32 overflow in the intermediate multiplication. Add regression test for large tensor cast. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/0b1e04ca-17bd-4f26-aaec-728240d54577 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
Correct and well-scoped fix for a real int32 overflow bug in CUDA Cast and UnaryElementWise kernels. The changes consistently replace CUDA_LONG (int32_t) with int64_t across kernel parameters and index calculations, matching the same fix pattern applied to Gather in PR #28108.
Positives:
- The
static_cast<int64_t>(NumElementsPerThread)correctly anchors the multiplication chain in 64-bit arithmetic before multiplying withblockIdx.x, preventing intermediate overflow. - The
unary_elementwise_impl.cuhheader change propagates the fix to all unary elementwise ops (Abs, Neg, Sqrt, Log, Exp, Erf, etc.) in a single edit. - All three cast kernel variants (
CastKernelStd,CastKernelSat,CudaCastPairwiseKernel) are consistently updated — no kernel was missed. - Removal of
static_cast<int>(num_of_elements)truncation in the launch functions is the most important part, since that's wheresize_t → int32_tsilently lost high bits.
Broader concern (out of scope): The CALCULATE_ELEMENTWISE_INDEX_OR_EXIT macro still uses CUDA_LONG and is used by dozens of other CUDA kernels (expand, tile, scatter_nd, resize, upsample, etc.) — they have the same int32 overflow vulnerability. Consider filing a follow-up issue to track the systemic fix.
|
@copilot please address comments and rebase from main |
…host-side overflow verification Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/d3f93dc0-5888-4bc0-a893-55294d96f191 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Addressed both review comments in f28a737. The rebase from main has a merge conflict in |
|
@copilot can you create a merge commit? |
…-kernel-crash # Conflicts: # onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
unary_elementwise_impl.cuh: ChangeCUDA_LONGtoint64_tforNparameter and loop index in_UnaryElementWisekernel, and fixblocksPerGridcalculationcast_op.cu: ChangeCUDA_LONGtoint64_tforNparameter and loop index inCastKernelStd,CastKernelSat, andCudaCastPairwiseKernelkernels, and removestatic_cast<int>truncationsize_tforpair_countin CudaCastPairwise to avoid double conversion (review feedback)CastKernelCorrectness_ModerateSizeand addCastKernel_Int64IndexArithmetic_NoOverflowhost-side test (review feedback)