Skip to content

[python-package] Handle indicator (boolean) features in trees_to_dataframe#12089

Merged
RAMitchell merged 5 commits intodmlc:masterfrom
thebandofficial:fix/trees-to-dataframe-indicator
Mar 17, 2026
Merged

[python-package] Handle indicator (boolean) features in trees_to_dataframe#12089
RAMitchell merged 5 commits intodmlc:masterfrom
thebandofficial:fix/trees-to-dataframe-indicator

Conversation

@PavelGuzenfeld
Copy link
Copy Markdown
Contributor

Summary

trees_to_dataframe() crashes with ValueError: Failed to parse model text dump
when the feature map contains indicator type ('i') features.

The C++ text dump produces three formats for split nodes:

Type Format Python parser
Quantitative 0:[f<0.5] yes=1,no=2,missing=1,gain=...,cover=... Handled
Categorical 0:[f:{0,1}] yes=1,no=2,missing=1,gain=...,cover=... Handled
Indicator 0:[f] yes=1,no=2,gain=...,cover=... Not handled

The indicator format has no < or :{, no split threshold, and no missing=
field. The parser now recognizes this third format and correctly sets Split
and Missing to NaN for indicator nodes.

Test plan

  • Added test_tree_to_df_indicator in test_parse_tree.py
  • Trains on binary features with an indicator feature map
  • Verifies the DataFrame is produced without error
  • Verifies indicator nodes have NaN splits and NaN missing

Closes #10437

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@RAMitchell RAMitchell requested a review from Copilot March 16, 2026 08:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +63
X = rng.randint(0, 2, size=(n_samples, n_features)).astype(np.float32)
y = (X[:, 0] ^ X[:, 1]).astype(np.float32)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 413b63c — XOR is now computed on the integer array before float32 cast.

Comment on lines +3162 to +3164
# {nid}:[{fname}] yes={yes},no={no}
# No split threshold or missing direction.
parse = [fid[0]]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 87a3050 — added validation that the bracket expression contains no < or :{ and the remainder has yes=/no=, otherwise raises ValueError.


# Create a feature map with indicator type 'i'
fmap_path = str(tmp_path / "fmap.txt")
with open(fmap_path, "w") as f:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 87a3050 — now uses encoding="utf-8".

@RAMitchell
Copy link
Copy Markdown
Member

This one looks like it's stacked on top of the docks PR?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

assert len(df) > 0

# Indicator nodes should have NaN splits and NaN missing
non_leaf = df[df.Feature != "Leaf"]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added assert len(non_leaf) > 0 in 2d1a71e.

Comment on lines +3187 to +3195
# Indicator nodes have no missing direction.
if len(stats) > 5 and stats[4] == "missing":
missings.append(str_i + "-" + stats[5])
gains.append(float(stats[7]))
covers.append(float(stats[9]))
else:
missings.append(float("NAN"))
gains.append(float(stats[5]))
covers.append(float(stats[7]))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — the C++ text dump sets no = DefaultChild(nid) for indicator splits, so the missing direction is the "no" child. Fixed in 2d1a71e: Missing now matches No instead of NaN.

PavelGuzenfeld and others added 5 commits March 17, 2026 11:41
trees_to_dataframe crashes with ValueError when the feature map
contains indicator type ('i') features, because the text dump format
for indicator nodes differs from quantitative and categorical:

  Quantitative: 0:[fname<0.5] yes=1,no=2,missing=1,gain=...,cover=...
  Categorical:  0:[fname:{0,1}] yes=1,no=2,missing=1,gain=...,cover=...
  Indicator:    0:[fname] yes=1,no=2,gain=...,cover=...

The indicator format has no split threshold, no curly braces, and no
missing direction. The parser now recognizes this format and sets
split and missing to NaN for indicator nodes.
Remove blank line between import groups (ruff check) and cast
float32 slices to int32 before applying bitwise XOR to fix
TypeError on numpy versions that enforce safe casting.
…explicit encoding

- Compute XOR on integer array before float32 cast to avoid TypeError
- Add validation in indicator else-branch so unrecognized formats still raise ValueError
- Use explicit encoding="utf-8" when writing the feature map file
…non_leaf non-empty

For indicator splits, the C++ text dump encodes the default (missing)
child as the "no" direction. Set Missing to match No instead of NaN.
Also assert non_leaf is non-empty so the test guarantees coverage of
indicator split parsing.
@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/trees-to-dataframe-indicator branch from 568f4cc to 2d1a71e Compare March 17, 2026 09:41
@PavelGuzenfeld
Copy link
Copy Markdown
Contributor Author

@RAMitchell Good catch — the docstring (-OO) commit was accidentally stacked on this branch. Rebased to remove it; this PR now only contains the indicator-feature fix.

@RAMitchell RAMitchell merged commit 9c7bd35 into dmlc:master Mar 17, 2026
83 checks passed
@PavelGuzenfeld PavelGuzenfeld deleted the fix/trees-to-dataframe-indicator branch March 18, 2026 21:03
PavelGuzenfeld added a commit to PavelGuzenfeld/xgboost that referenced this pull request Mar 19, 2026
Add section documenting 3 merged PRs (dmlc#12087, dmlc#12089, dmlc#12094),
1 open PR (dmlc#12086), and 3 closed/superseded PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

booster.trees_to_dataframe crashes when there are boolean feature_types 'i'

3 participants