Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion python-package/xgboost/_data_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -581,6 +581,24 @@ def wait_event(event_hdl: int) -> None:
raise ValueError(msg)


def parse_cal_ver(ver: str) -> tuple[int, int]:
"""Parse calendar version."""
vers = ver.strip().split(".")
return int(vers[0]), int(vers[1])


@fcache
def cudf_has_mode() -> bool:
"""When cuDF >= 26.02, the `to_pylibcudf` method is read-only."""
import cudf

try:
vers = parse_cal_ver(cudf.__version__)
return vers[0] > 26 or (vers[0] == 26) and vers[1] >= 2
except Exception: # pylint: disable=broad-exception-caught
return True


def cudf_cat_inf( # pylint: disable=too-many-locals
cats: DfCatAccessor, codes: "pd.Series"
) -> Tuple[Union[CudaArrayInf, CudaStringArray], ArrayInf, Tuple]:
Expand All @@ -595,7 +613,10 @@ def cudf_cat_inf( # pylint: disable=too-many-locals
return cats_ainf, codes_ainf, (cats, codes)

# pylint: disable=protected-access
arrow_col = cats._column.to_pylibcudf(mode="read")
if cudf_has_mode():
arrow_col = cats._column.to_pylibcudf()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the use case for this call.

IIUC at this point, this function wants to return the cuda array interface for categories that are strings but it appears you are going through the arrow c (device) interface to get there?

Copy link
Copy Markdown
Member Author

@trivialfis trivialfis Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this. Yes, that's the intention of the function call. At the time this was suggested to me in an offline conversation, but if there's a better way to achieve this please share.

For context, XGBoost does re-coding in CUDA/C++ and needs to extract the data from cuDF python

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. The results from the pylibcudf.Column.data()/null_mask() functions expose a __cuda_array_interface__ attribute that you can use when composing jnames below. e.g.

In [1]: import pylibcudf as plc, pyarrow as pa

In [2]: plc_col = plc.Column.from_arrow(pa.array(["a", "b", None]))

# the data of the "values"
In [3]: plc_col.data().__cuda_array_interface__
Out[3]: 
{'shape': (2,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835543040, False),
 'version': 3}

# the null mask of the "values"
In [5]: plc_col.null_mask().__cuda_array_interface__
Out[5]: 
{'shape': (64,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835542016, False),
 'version': 3}

# the "offsets"
In [6]: plc_col.children()[0].data().__cuda_array_interface__
Out[6]: 
{'shape': (16,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835542528, False),
 'version': 3}

You would probably need to perform the string dtype check on the cuDF Python object first since the typestr here doesn't indicate "string" e.g.

if not (cats._column.dtype == np.dtype("object") or isinstance(cats._column.dtype, pd.StringDtype)):
   raise TypeError(
            "Unexpected type for category index. It's neither numeric nor string."
        )

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential question, I also see DfCatAccessor here indicates that you could be passing around an cudf/pd.Series.cat object. Is that intentional?

else:
arrow_col = cats._column.to_pylibcudf(mode="read")
# Tuple[types.CapsuleType, types.CapsuleType]
schema, array = arrow_col.__arrow_c_device_array__()

Expand Down
Loading