Skip to content

[CUDA] Fix division by zero in histogram construction with discrete data#7123

Open
Copilot wants to merge 5 commits intomasterfrom
copilot/fix-sigfpe-discrete-data
Open

[CUDA] Fix division by zero in histogram construction with discrete data#7123
Copilot wants to merge 5 commits intomasterfrom
copilot/fix-sigfpe-discrete-data

Conversation

Copy link
Copy Markdown

Copilot AI commented Jan 9, 2026

CUDA builds crash with SIGFPE when training on discrete data where n_unique_values * n_features exceeds histogram bin thresholds (e.g., 5 values × 600 features).

Root Cause

Division by zero in CalcConstructHistogramKernelDim():

*block_dim_y = NUM_THREADS_PER_BLOCK / cuda_row_data_->max_num_column_per_partition();

The feature partitioning logic in cuda_row_data.cpp can produce max_num_column_per_partition_ = 0 when all partitions end up with zero columns—an edge case in how bins are distributed across partitions.

Changes

  • src/io/cuda/cuda_row_data.cpp: Guard max_num_column_per_partition_ to be at least 1 in both DivideCUDAFeatureGroups() and GetSparseDataPartitioned()
  • tests/python_package_test/test_engine.py: Add regression test reproducing the exact failure scenario (50k rows × 600 features × 5 discrete values)

Example

import lightgbm as lgb
import numpy as np

# Previously crashed with SIGFPE
X = np.random.randint(0, 5, (50000, 600)).astype(np.float32)
y = np.random.uniform(0, 1, 50000).astype(np.float32)

model = lgb.LGBMRegressor(device='cuda', n_estimators=10)
model.fit(X, y)  # Now succeeds
Original prompt

This section details on the original issue you should resolve

<issue_title>[CUDA] SIGFPE (Floating point exception) with discrete data when n_unique_values * n_features exceeds threshold</issue_title>
<issue_description>## Description

LightGBM with device='cuda' crashes with SIGFPE (Floating point exception) when training on discrete data where the product of unique values and number of features exceeds a certain threshold.

Reproducible Example

import lightgbm as lgb
import numpy as np

# FAIL: 5 discrete values × 600 features
X = np.random.randint(0, 5, (50000, 600)).astype(np.float32)
y = np.random.uniform(0, 1, 50000).astype(np.float32)

model = lgb.LGBMRegressor(device='cuda', n_estimators=10, verbose=-1)
model.fit(X, y)  # SIGFPE: Floating point exception (core dumped)

Test Results

Tested with 50,000 rows:

Unique Values 500 cols 600 cols 700 cols
2 Pass Pass Pass
3 Pass Pass SIGFPE
4 Pass Pass SIGFPE
5 Pass SIGFPE SIGFPE
6 Pass SIGFPE SIGFPE
7 Pass SIGFPE SIGFPE
8 Pass SIGFPE SIGFPE

Observed Pattern

The crash threshold depends on both the number of unique values and the number of features:

Unique Values Approximate Safe Column Limit
2 700+
3-4 600-700
5+ 500

This suggests a relationship between n_unique * n_features and available CUDA histogram bins.

Workaround

Adding tiny noise converts discrete values to continuous and avoids the crash:

X = X.astype(np.float32)
X += np.random.uniform(-1e-6, 1e-6, X.shape).astype(np.float32)
# Training now succeeds

Real-World Impact

This bug affects Numerai tournament data:

  • 2.7M rows × 2376 features
  • int8 dtype with values {0, 1, 2, 3, 4} (5 discrete values)
  • Always triggers SIGFPE with CUDA

Environment

  • LightGBM version: 4.6.0.99 (source-built with GCC 10)
  • CUDA version: 12.6
  • GPU: NVIDIA RTX 5000 Ada Generation (Compute Capability 8.9)
  • Driver: 572.16
  • OS: Windows 11 + WSL2 (Ubuntu) + Docker (nvidia-docker)
  • Python: 3.10

Build Command

cmake -B build -S . \
    -DUSE_CUDA=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=gcc-10 \
    -DCMAKE_CXX_COMPILER=g++-10
cmake --build build -j$(nproc)
cd python-package && pip install .

Notes

  • CPU training works fine with the same data
  • The issue appears to be in CUDA histogram binning logic
  • Row count does not significantly affect the threshold</issue_description>

Comments on the Issue (you are @copilot in this section)

@jameslamb Thank you for the EXCELLENT report! We will look into this when we can.

Hopefully @shiyu1994 can take a look.</comment_new>
<comment_new>@shiyu1994
Thanks for reporting this. This could due to some part of the code uses a integer type which is not long enough. I will investigate this.!</comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 4 commits January 9, 2026 17:19
Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>
Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>
Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>
Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix SIGFPE crash in LightGBM with CUDA on discrete data [CUDA] Fix division by zero in histogram construction with discrete data Jan 9, 2026
Copilot AI requested a review from shiyu1994 January 9, 2026 17:24
@shiyu1994 shiyu1994 marked this pull request as ready for review January 9, 2026 17:25
Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiyu1994 for looking into this so soon!

But the tests are segfaulting and the test code has some issues. Pelase see my comments.

y = np.random.uniform(0, 1, 50000).astype(np.float32)

# This should not crash with SIGFPE
model = lgb.LGBMRegressor(device='cuda', n_estimators=10, verbose=-1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be using a scikit-learn estimator (LGBMRegressor) here in test_engine.py. This should be changes to use lgb.train().

# Test case from the issue: 5 discrete values × 600 features
# This used to cause SIGFPE due to division by zero in max_num_column_per_partition_
np.random.seed(42)
X = np.random.randint(0, 5, (50000, 600)).astype(np.float32)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we REALLY need 600 features to observe this failure? That's a huge and costly dataset for testing. It might make it difficult to run the tests on some types of systems.

Can we reproduce this with a smaller dataset?

@jameslamb jameslamb mentioned this pull request Jan 17, 2026
40 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CUDA] SIGFPE (Floating point exception) with discrete data when n_unique_values * n_features exceeds threshold

3 participants