-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
Fix categorical data support with cuDF 2602. #12140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -581,6 +581,24 @@ def wait_event(event_hdl: int) -> None: | |
| raise ValueError(msg) | ||
|
|
||
|
|
||
| def parse_cal_ver(ver: str) -> tuple[int, int]: | ||
| """Parse calendar version.""" | ||
| vers = ver.strip().split(".") | ||
| return int(vers[0]), int(vers[1]) | ||
|
|
||
|
|
||
| @fcache | ||
| def cudf_has_mode() -> bool: | ||
| """When cuDF >= 26.02, the `to_pylibcudf` method is read-only.""" | ||
| import cudf | ||
|
|
||
| try: | ||
| vers = parse_cal_ver(cudf.__version__) | ||
| return vers[0] > 26 or (vers[0] == 26) and vers[1] >= 2 | ||
| except Exception: # pylint: disable=broad-exception-caught | ||
| return True | ||
|
|
||
|
|
||
| def cudf_cat_inf( # pylint: disable=too-many-locals | ||
| cats: DfCatAccessor, codes: "pd.Series" | ||
| ) -> Tuple[Union[CudaArrayInf, CudaStringArray], ArrayInf, Tuple]: | ||
|
|
@@ -595,7 +613,10 @@ def cudf_cat_inf( # pylint: disable=too-many-locals | |
| return cats_ainf, codes_ainf, (cats, codes) | ||
|
|
||
| # pylint: disable=protected-access | ||
| arrow_col = cats._column.to_pylibcudf(mode="read") | ||
| if cudf_has_mode(): | ||
| arrow_col = cats._column.to_pylibcudf() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just curious about the use case for this call. IIUC at this point, this function wants to return the cuda array interface for categories that are strings but it appears you are going through the arrow c (device) interface to get there?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for looking into this. Yes, that's the intention of the function call. At the time this was suggested to me in an offline conversation, but if there's a better way to achieve this please share. For context, XGBoost does re-coding in CUDA/C++ and needs to extract the data from cuDF python There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure thing. The results from the In [1]: import pylibcudf as plc, pyarrow as pa
In [2]: plc_col = plc.Column.from_arrow(pa.array(["a", "b", None]))
# the data of the "values"
In [3]: plc_col.data().__cuda_array_interface__
Out[3]:
{'shape': (2,),
'strides': None,
'typestr': '|u1',
'data': (126954835543040, False),
'version': 3}
# the null mask of the "values"
In [5]: plc_col.null_mask().__cuda_array_interface__
Out[5]:
{'shape': (64,),
'strides': None,
'typestr': '|u1',
'data': (126954835542016, False),
'version': 3}
# the "offsets"
In [6]: plc_col.children()[0].data().__cuda_array_interface__
Out[6]:
{'shape': (16,),
'strides': None,
'typestr': '|u1',
'data': (126954835542528, False),
'version': 3}You would probably need to perform the string dtype check on the cuDF Python object first since the if not (cats._column.dtype == np.dtype("object") or isinstance(cats._column.dtype, pd.StringDtype)):
raise TypeError(
"Unexpected type for category index. It's neither numeric nor string."
)There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tangential question, I also see |
||
| else: | ||
| arrow_col = cats._column.to_pylibcudf(mode="read") | ||
| # Tuple[types.CapsuleType, types.CapsuleType] | ||
| schema, array = arrow_col.__arrow_c_device_array__() | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.