Fix categorical data support with cuDF 2602.#12140
Fix categorical data support with cuDF 2602.#12140trivialfis wants to merge 5 commits intodmlc:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a cuDF 26.02 compatibility break that caused XGBoost to crash when ingesting cuDF categorical columns on GPU by adapting how category columns are converted to pylibcudf / Arrow device arrays.
Changes:
- Update
cudf_cat_infto callto_pylibcudf()without the removedmodeargument on newer cuDF versions. - Add a fallback path for older cuDF versions that still require/accept
mode="read".
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # pylint: disable=protected-access | ||
| arrow_col = cats._column.to_pylibcudf(mode="read") | ||
| if cudf_read_only(): | ||
| arrow_col = cats._column.to_pylibcudf() |
There was a problem hiding this comment.
Just curious about the use case for this call.
IIUC at this point, this function wants to return the cuda array interface for categories that are strings but it appears you are going through the arrow c (device) interface to get there?
There was a problem hiding this comment.
Thank you for looking into this. Yes, that's the intention of the function call. At the time this was suggested to me in an offline conversation, but if there's a better way to achieve this please share.
For context, XGBoost does re-coding in CUDA/C++ and needs to extract the data from cuDF python
There was a problem hiding this comment.
Sure thing. The results from the pylibcudf.Column.data()/null_mask() functions expose a __cuda_array_interface__ attribute that you can use when composing jnames below. e.g.
In [1]: import pylibcudf as plc, pyarrow as pa
In [2]: plc_col = plc.Column.from_arrow(pa.array(["a", "b", None]))
# the data of the "values"
In [3]: plc_col.data().__cuda_array_interface__
Out[3]:
{'shape': (2,),
'strides': None,
'typestr': '|u1',
'data': (126954835543040, False),
'version': 3}
# the null mask of the "values"
In [5]: plc_col.null_mask().__cuda_array_interface__
Out[5]:
{'shape': (64,),
'strides': None,
'typestr': '|u1',
'data': (126954835542016, False),
'version': 3}
# the "offsets"
In [6]: plc_col.children()[0].data().__cuda_array_interface__
Out[6]:
{'shape': (16,),
'strides': None,
'typestr': '|u1',
'data': (126954835542528, False),
'version': 3}You would probably need to perform the string dtype check on the cuDF Python object first since the typestr here doesn't indicate "string" e.g.
if not (cats._column.dtype == np.dtype("object") or isinstance(cats._column.dtype, pd.StringDtype)):
raise TypeError(
"Unexpected type for category index. It's neither numeric nor string."
)There was a problem hiding this comment.
Tangential question, I also see DfCatAccessor here indicates that you could be passing around an cudf/pd.Series.cat object. Is that intentional?
Close #12138