[python-package] Handle indicator (boolean) features in trees_to_dataframe#12089
Conversation
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/python/test_parse_tree.py
Outdated
| X = rng.randint(0, 2, size=(n_samples, n_features)).astype(np.float32) | ||
| y = (X[:, 0] ^ X[:, 1]).astype(np.float32) |
There was a problem hiding this comment.
Fixed in 413b63c — XOR is now computed on the integer array before float32 cast.
python-package/xgboost/core.py
Outdated
| # {nid}:[{fname}] yes={yes},no={no} | ||
| # No split threshold or missing direction. | ||
| parse = [fid[0]] |
There was a problem hiding this comment.
Fixed in 87a3050 — added validation that the bracket expression contains no < or :{ and the remainder has yes=/no=, otherwise raises ValueError.
tests/python/test_parse_tree.py
Outdated
|
|
||
| # Create a feature map with indicator type 'i' | ||
| fmap_path = str(tmp_path / "fmap.txt") | ||
| with open(fmap_path, "w") as f: |
There was a problem hiding this comment.
Fixed in 87a3050 — now uses encoding="utf-8".
|
This one looks like it's stacked on top of the docks PR? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert len(df) > 0 | ||
|
|
||
| # Indicator nodes should have NaN splits and NaN missing | ||
| non_leaf = df[df.Feature != "Leaf"] |
There was a problem hiding this comment.
Good catch — added assert len(non_leaf) > 0 in 2d1a71e.
python-package/xgboost/core.py
Outdated
| # Indicator nodes have no missing direction. | ||
| if len(stats) > 5 and stats[4] == "missing": | ||
| missings.append(str_i + "-" + stats[5]) | ||
| gains.append(float(stats[7])) | ||
| covers.append(float(stats[9])) | ||
| else: | ||
| missings.append(float("NAN")) | ||
| gains.append(float(stats[5])) | ||
| covers.append(float(stats[7])) |
There was a problem hiding this comment.
Correct — the C++ text dump sets no = DefaultChild(nid) for indicator splits, so the missing direction is the "no" child. Fixed in 2d1a71e: Missing now matches No instead of NaN.
trees_to_dataframe crashes with ValueError when the feature map
contains indicator type ('i') features, because the text dump format
for indicator nodes differs from quantitative and categorical:
Quantitative: 0:[fname<0.5] yes=1,no=2,missing=1,gain=...,cover=...
Categorical: 0:[fname:{0,1}] yes=1,no=2,missing=1,gain=...,cover=...
Indicator: 0:[fname] yes=1,no=2,gain=...,cover=...
The indicator format has no split threshold, no curly braces, and no
missing direction. The parser now recognizes this format and sets
split and missing to NaN for indicator nodes.
Remove blank line between import groups (ruff check) and cast float32 slices to int32 before applying bitwise XOR to fix TypeError on numpy versions that enforce safe casting.
…explicit encoding - Compute XOR on integer array before float32 cast to avoid TypeError - Add validation in indicator else-branch so unrecognized formats still raise ValueError - Use explicit encoding="utf-8" when writing the feature map file
…non_leaf non-empty For indicator splits, the C++ text dump encodes the default (missing) child as the "no" direction. Set Missing to match No instead of NaN. Also assert non_leaf is non-empty so the test guarantees coverage of indicator split parsing.
568f4cc to
2d1a71e
Compare
|
@RAMitchell Good catch — the docstring ( |
Add section documenting 3 merged PRs (dmlc#12087, dmlc#12089, dmlc#12094), 1 open PR (dmlc#12086), and 3 closed/superseded PRs.
Summary
trees_to_dataframe()crashes withValueError: Failed to parse model text dumpwhen the feature map contains indicator type (
'i') features.The C++ text dump produces three formats for split nodes:
0:[f<0.5] yes=1,no=2,missing=1,gain=...,cover=...0:[f:{0,1}] yes=1,no=2,missing=1,gain=...,cover=...0:[f] yes=1,no=2,gain=...,cover=...The indicator format has no
<or:{, no split threshold, and nomissing=field. The parser now recognizes this third format and correctly sets
Splitand
Missingto NaN for indicator nodes.Test plan
test_tree_to_df_indicatorintest_parse_tree.pyCloses #10437