Skip to content

Fix categorical data support with cuDF 2602.#12140

Open
trivialfis wants to merge 5 commits intodmlc:masterfrom
trivialfis:cudf-2602
Open

Fix categorical data support with cuDF 2602.#12140
trivialfis wants to merge 5 commits intodmlc:masterfrom
trivialfis:cudf-2602

Conversation

@trivialfis
Copy link
Copy Markdown
Member

Close #12138

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a cuDF 26.02 compatibility break that caused XGBoost to crash when ingesting cuDF categorical columns on GPU by adapting how category columns are converted to pylibcudf / Arrow device arrays.

Changes:

  • Update cudf_cat_inf to call to_pylibcudf() without the removed mode argument on newer cuDF versions.
  • Add a fallback path for older cuDF versions that still require/accept mode="read".

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@trivialfis trivialfis requested a review from hcho3 April 3, 2026 11:54
# pylint: disable=protected-access
arrow_col = cats._column.to_pylibcudf(mode="read")
if cudf_read_only():
arrow_col = cats._column.to_pylibcudf()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the use case for this call.

IIUC at this point, this function wants to return the cuda array interface for categories that are strings but it appears you are going through the arrow c (device) interface to get there?

Copy link
Copy Markdown
Member Author

@trivialfis trivialfis Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this. Yes, that's the intention of the function call. At the time this was suggested to me in an offline conversation, but if there's a better way to achieve this please share.

For context, XGBoost does re-coding in CUDA/C++ and needs to extract the data from cuDF python

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. The results from the pylibcudf.Column.data()/null_mask() functions expose a __cuda_array_interface__ attribute that you can use when composing jnames below. e.g.

In [1]: import pylibcudf as plc, pyarrow as pa

In [2]: plc_col = plc.Column.from_arrow(pa.array(["a", "b", None]))

# the data of the "values"
In [3]: plc_col.data().__cuda_array_interface__
Out[3]: 
{'shape': (2,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835543040, False),
 'version': 3}

# the null mask of the "values"
In [5]: plc_col.null_mask().__cuda_array_interface__
Out[5]: 
{'shape': (64,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835542016, False),
 'version': 3}

# the "offsets"
In [6]: plc_col.children()[0].data().__cuda_array_interface__
Out[6]: 
{'shape': (16,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835542528, False),
 'version': 3}

You would probably need to perform the string dtype check on the cuDF Python object first since the typestr here doesn't indicate "string" e.g.

if not (cats._column.dtype == np.dtype("object") or isinstance(cats._column.dtype, pd.StringDtype)):
   raise TypeError(
            "Unexpected type for category index. It's neither numeric nor string."
        )

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential question, I also see DfCatAccessor here indicates that you could be passing around an cudf/pd.Series.cat object. Is that intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xgboost 3.2.0 crashes with cudf 26.02 when there are categorical features

3 participants