Fix categorical data support with cuDF 2602. by trivialfis · Pull Request #12140 · dmlc/xgboost

trivialfis · 2026-04-03T10:44:11Z

Copilot

Pull request overview

Fixes a cuDF 26.02 compatibility break that caused XGBoost to crash when ingesting cuDF categorical columns on GPU by adapting how category columns are converted to pylibcudf / Arrow device arrays.

Changes:

Update cudf_cat_inf to call to_pylibcudf() without the removed mode argument on newer cuDF versions.
Add a fallback path for older cuDF versions that still require/accept mode="read".

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python-package/xgboost/_data_utils.py

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python-package/xgboost/_data_utils.py

mroeschke · 2026-04-06T21:27:40Z

python-package/xgboost/_data_utils.py

    # pylint: disable=protected-access
-    arrow_col = cats._column.to_pylibcudf(mode="read")
+    if cudf_read_only():
+        arrow_col = cats._column.to_pylibcudf()


Just curious about the use case for this call.

IIUC at this point, this function wants to return the cuda array interface for categories that are strings but it appears you are going through the arrow c (device) interface to get there?

Thank you for looking into this. Yes, that's the intention of the function call. At the time this was suggested to me in an offline conversation, but if there's a better way to achieve this please share.

For context, XGBoost does re-coding in CUDA/C++ and needs to extract the data from cuDF python

Sure thing. The results from the pylibcudf.Column.data()/null_mask() functions expose a __cuda_array_interface__ attribute that you can use when composing jnames below. e.g.

In [1]: import pylibcudf as plc, pyarrow as pa In [2]: plc_col = plc.Column.from_arrow(pa.array(["a", "b", None])) # the data of the "values" In [3]: plc_col.data().__cuda_array_interface__ Out[3]: {'shape': (2,), 'strides': None, 'typestr': '|u1', 'data': (126954835543040, False), 'version': 3} # the null mask of the "values" In [5]: plc_col.null_mask().__cuda_array_interface__ Out[5]: {'shape': (64,), 'strides': None, 'typestr': '|u1', 'data': (126954835542016, False), 'version': 3} # the "offsets" In [6]: plc_col.children()[0].data().__cuda_array_interface__ Out[6]: {'shape': (16,), 'strides': None, 'typestr': '|u1', 'data': (126954835542528, False), 'version': 3}

You would probably need to perform the string dtype check on the cuDF Python object first since the typestr here doesn't indicate "string" e.g.

if not (cats._column.dtype == np.dtype("object") or isinstance(cats._column.dtype, pd.StringDtype)): raise TypeError( "Unexpected type for category index. It's neither numeric nor string." )

Tangential question, I also see DfCatAccessor here indicates that you could be passing around an cudf/pd.Series.cat object. Is that intentional?

Fix categorical data support with cuDF 2602.

f434de1

trivialfis requested a review from Copilot April 3, 2026 10:46

Copilot started reviewing on behalf of trivialfis April 3, 2026 10:46 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

python-package/xgboost/_data_utils.py Outdated Show resolved Hide resolved

trivialfis added 2 commits April 3, 2026 19:14

Use version instead.

60c2b8c

lint.

c208f9a

trivialfis requested a review from Copilot April 3, 2026 11:29

Copilot started reviewing on behalf of trivialfis April 3, 2026 11:30 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

python-package/xgboost/_data_utils.py Show resolved Hide resolved

trivialfis added 2 commits April 3, 2026 19:47

lint.

1c40fdb

lint.

abe7222

trivialfis requested a review from hcho3 April 3, 2026 11:54

jameslamb mentioned this pull request Apr 6, 2026

RAPIDS 26.02 support #12143

Open

mroeschke reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix categorical data support with cuDF 2602.#12140

Fix categorical data support with cuDF 2602.#12140
trivialfis wants to merge 5 commits intodmlc:masterfrom
trivialfis:cudf-2602

trivialfis commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

mroeschke Apr 6, 2026

Uh oh!

trivialfis Apr 6, 2026 •

edited

Loading

Uh oh!

mroeschke Apr 7, 2026

Uh oh!

mroeschke Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

trivialfis commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

mroeschke Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

trivialfis Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

mroeschke Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trivialfis Apr 6, 2026 •

edited

Loading