[CUDA] Fix division by zero in histogram construction with discrete data by Copilot · Pull Request #7123 · lightgbm-org/LightGBM

Copilot · 2026-01-09T17:14:28Z

CUDA builds crash with SIGFPE when training on discrete data where n_unique_values * n_features exceeds histogram bin thresholds (e.g., 5 values × 600 features).

Root Cause

Division by zero in CalcConstructHistogramKernelDim():

*block_dim_y = NUM_THREADS_PER_BLOCK / cuda_row_data_->max_num_column_per_partition();

The feature partitioning logic in cuda_row_data.cpp can produce max_num_column_per_partition_ = 0 when all partitions end up with zero columns—an edge case in how bins are distributed across partitions.

Changes

src/io/cuda/cuda_row_data.cpp: Guard max_num_column_per_partition_ to be at least 1 in both DivideCUDAFeatureGroups() and GetSparseDataPartitioned()
tests/python_package_test/test_engine.py: Add regression test reproducing the exact failure scenario (50k rows × 600 features × 5 discrete values)

Example

import lightgbm as lgb
import numpy as np

# Previously crashed with SIGFPE
X = np.random.randint(0, 5, (50000, 600)).astype(np.float32)
y = np.random.uniform(0, 1, 50000).astype(np.float32)

model = lgb.LGBMRegressor(device='cuda', n_estimators=10)
model.fit(X, y)  # Now succeeds

Original prompt

This section details on the original issue you should resolve

<issue_title>[CUDA] SIGFPE (Floating point exception) with discrete data when n_unique_values * n_features exceeds threshold</issue_title>
<issue_description>## Description

LightGBM with device='cuda' crashes with SIGFPE (Floating point exception) when training on discrete data where the product of unique values and number of features exceeds a certain threshold.

Reproducible Example
import lightgbm as lgb
import numpy as np

# FAIL: 5 discrete values × 600 features
X = np.random.randint(0, 5, (50000, 600)).astype(np.float32)
y = np.random.uniform(0, 1, 50000).astype(np.float32)

model = lgb.LGBMRegressor(device='cuda', n_estimators=10, verbose=-1)
model.fit(X, y)  # SIGFPE: Floating point exception (core dumped)
Test Results

Tested with 50,000 rows:

Unique Values 500 cols 600 cols 700 cols

2 Pass Pass Pass

3 Pass Pass SIGFPE

4 Pass Pass SIGFPE

5 Pass SIGFPE SIGFPE

6 Pass SIGFPE SIGFPE

7 Pass SIGFPE SIGFPE

8 Pass SIGFPE SIGFPE

Observed Pattern

The crash threshold depends on both the number of unique values and the number of features:

Unique Values Approximate Safe Column Limit

2 700+

3-4 600-700

5+ 500

This suggests a relationship between n_unique * n_features and available CUDA histogram bins.

Workaround

Adding tiny noise converts discrete values to continuous and avoids the crash:
X = X.astype(np.float32)
X += np.random.uniform(-1e-6, 1e-6, X.shape).astype(np.float32)
# Training now succeeds
Real-World Impact

This bug affects Numerai tournament data:

2.7M rows × 2376 features

int8 dtype with values {0, 1, 2, 3, 4} (5 discrete values)

Always triggers SIGFPE with CUDA

Environment

LightGBM version: 4.6.0.99 (source-built with GCC 10)

CUDA version: 12.6

GPU: NVIDIA RTX 5000 Ada Generation (Compute Capability 8.9)

Driver: 572.16

OS: Windows 11 + WSL2 (Ubuntu) + Docker (nvidia-docker)

Python: 3.10

Build Command
cmake -B build -S . \
    -DUSE_CUDA=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=gcc-10 \
    -DCMAKE_CXX_COMPILER=g++-10
cmake --build build -j$(nproc)
cd python-package && pip install .
Notes

CPU training works fine with the same data

The issue appears to be in CUDA histogram binning logic

Row count does not significantly affect the threshold</issue_description>

Comments on the Issue (you are @copilot in this section)
@jameslamb Thank you for the EXCELLENT report! We will look into this when we can.
Hopefully @shiyu1994 can take a look.</comment_new>
<comment_new>@shiyu1994
Thanks for reporting this. This could due to some part of the code uses a integer type which is not long enough. I will investigate this.!</comment_new>

Fixes [CUDA] SIGFPE (Floating point exception) with discrete data when n_unique_values * n_features exceeds threshold #7122

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>

jameslamb

Thanks @shiyu1994 for looking into this so soon!

But the tests are segfaulting and the test code has some issues. Pelase see my comments.

jameslamb · 2026-01-09T19:13:37Z

tests/python_package_test/test_engine.py

+    y = np.random.uniform(0, 1, 50000).astype(np.float32)
+
+    # This should not crash with SIGFPE
+    model = lgb.LGBMRegressor(device='cuda', n_estimators=10, verbose=-1)


We should not be using a scikit-learn estimator (LGBMRegressor) here in test_engine.py. This should be changes to use lgb.train().

jameslamb · 2026-01-09T19:14:42Z

tests/python_package_test/test_engine.py

+    # Test case from the issue: 5 discrete values × 600 features
+    # This used to cause SIGFPE due to division by zero in max_num_column_per_partition_
+    np.random.seed(42)
+    X = np.random.randint(0, 5, (50000, 600)).astype(np.float32)


Do we REALLY need 600 features to observe this failure? That's a huge and costly dataset for testing. It might make it difficult to run the tests on some types of systems.

Can we reproduce this with a smaller dataset?

Initial plan

ed1fcc4

Copilot AI assigned Copilot and shiyu1994 Jan 9, 2026

Copilot started work on behalf of shiyu1994 January 9, 2026 17:14 View session

Copilot AI and others added 4 commits January 9, 2026 17:19

Fix SIGFPE: ensure max_num_column_per_partition is never zero

6cb2e94

Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>

Add test for CUDA discrete data with many features

e32fd46

Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>

Final update: CUDA SIGFPE fix complete with tests and security checks

bbebeff

Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>

Clean up: remove CodeQL build artifacts and update .gitignore

6fa10da

Co-authored-by: shiyu1994 <14541765+shiyu1994@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix SIGFPE crash in LightGBM with CUDA on discrete data~~ [CUDA] Fix division by zero in histogram construction with discrete data Jan 9, 2026

Copilot finished work on behalf of shiyu1994 January 9, 2026 17:24

Copilot AI requested a review from shiyu1994 January 9, 2026 17:24

shiyu1994 marked this pull request as ready for review January 9, 2026 17:25

shiyu1994 requested review from StrikerRUS, borchero, guolinke, jameslamb and jmoralez as code owners January 9, 2026 17:25

jameslamb requested changes Jan 9, 2026

View reviewed changes

jameslamb mentioned this pull request Jan 17, 2026

release v4.7.0 #7129

Open

40 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Fix division by zero in histogram construction with discrete data#7123

[CUDA] Fix division by zero in histogram construction with discrete data#7123
Copilot wants to merge 5 commits intomasterfrom
copilot/fix-sigfpe-discrete-data

Copilot AI commented Jan 9, 2026 •

edited

Loading

Uh oh!

jameslamb left a comment

Uh oh!

jameslamb Jan 9, 2026

Uh oh!

jameslamb Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Unique Values	500 cols	600 cols	700 cols
2	Pass	Pass	Pass
3	Pass	Pass	SIGFPE
4	Pass	Pass	SIGFPE
5	Pass	SIGFPE	SIGFPE
6	Pass	SIGFPE	SIGFPE
7	Pass	SIGFPE	SIGFPE
8	Pass	SIGFPE	SIGFPE

Conversation

Copilot AI commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Changes

Example

Reproducible Example

Test Results

Observed Pattern

Workaround

Real-World Impact

Environment

Build Command

Notes

Comments on the Issue (you are @copilot in this section)

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

jameslamb Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jameslamb Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jan 9, 2026 •

edited

Loading