[python-package] Handle indicator (boolean) features in trees_to_dataframe by PavelGuzenfeld · Pull Request #12089 · dmlc/xgboost

PavelGuzenfeld · 2026-03-15T19:27:45Z

Summary

trees_to_dataframe() crashes with ValueError: Failed to parse model text dump
when the feature map contains indicator type ('i') features.

The C++ text dump produces three formats for split nodes:

Type	Format	Python parser
Quantitative	`0:[f<0.5] yes=1,no=2,missing=1,gain=...,cover=...`	Handled
Categorical	`0:[f:{0,1}] yes=1,no=2,missing=1,gain=...,cover=...`	Handled
Indicator	`0:[f] yes=1,no=2,gain=...,cover=...`	Not handled

The indicator format has no < or :{, no split threshold, and no missing=
field. The parser now recognizes this third format and correctly sets Split
and Missing to NaN for indicator nodes.

Test plan

Added test_tree_to_df_indicator in test_parse_tree.py
Trains on binary features with an indicator feature map
Verifies the DataFrame is produced without error
Verifies indicator nodes have NaN splits and NaN missing

Closes #10437

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PavelGuzenfeld · 2026-03-17T09:42:15Z

tests/python/test_parse_tree.py

+        X = rng.randint(0, 2, size=(n_samples, n_features)).astype(np.float32)
+        y = (X[:, 0] ^ X[:, 1]).astype(np.float32)


Fixed in 413b63c — XOR is now computed on the integer array before float32 cast.

PavelGuzenfeld · 2026-03-17T09:42:18Z

python-package/xgboost/core.py

+                        #   {nid}:[{fname}] yes={yes},no={no}
+                        # No split threshold or missing direction.
+                        parse = [fid[0]]


Fixed in 87a3050 — added validation that the bracket expression contains no < or :{ and the remainder has yes=/no=, otherwise raises ValueError.

PavelGuzenfeld · 2026-03-17T09:42:20Z

tests/python/test_parse_tree.py

+
+        # Create a feature map with indicator type 'i'
+        fmap_path = str(tmp_path / "fmap.txt")
+        with open(fmap_path, "w") as f:


Fixed in 87a3050 — now uses encoding="utf-8".

RAMitchell · 2026-03-17T08:48:33Z

This one looks like it's stacked on top of the docks PR?

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PavelGuzenfeld · 2026-03-17T09:42:04Z

tests/python/test_parse_tree.py

+        assert len(df) > 0
+
+        # Indicator nodes should have NaN splits and NaN missing
+        non_leaf = df[df.Feature != "Leaf"]


Good catch — added assert len(non_leaf) > 0 in 2d1a71e.

PavelGuzenfeld · 2026-03-17T09:42:06Z

python-package/xgboost/core.py

+                    # Indicator nodes have no missing direction.
+                    if len(stats) > 5 and stats[4] == "missing":
+                        missings.append(str_i + "-" + stats[5])
+                        gains.append(float(stats[7]))
+                        covers.append(float(stats[9]))
+                    else:
+                        missings.append(float("NAN"))
+                        gains.append(float(stats[5]))
+                        covers.append(float(stats[7]))


Correct — the C++ text dump sets no = DefaultChild(nid) for indicator splits, so the missing direction is the "no" child. Fixed in 2d1a71e: Missing now matches No instead of NaN.

trees_to_dataframe crashes with ValueError when the feature map contains indicator type ('i') features, because the text dump format for indicator nodes differs from quantitative and categorical: Quantitative: 0:[fname<0.5] yes=1,no=2,missing=1,gain=...,cover=... Categorical: 0:[fname:{0,1}] yes=1,no=2,missing=1,gain=...,cover=... Indicator: 0:[fname] yes=1,no=2,gain=...,cover=... The indicator format has no split threshold, no curly braces, and no missing direction. The parser now recognizes this format and sets split and missing to NaN for indicator nodes.

Remove blank line between import groups (ruff check) and cast float32 slices to int32 before applying bitwise XOR to fix TypeError on numpy versions that enforce safe casting.

…explicit encoding - Compute XOR on integer array before float32 cast to avoid TypeError - Add validation in indicator else-branch so unrecognized formats still raise ValueError - Use explicit encoding="utf-8" when writing the feature map file

…non_leaf non-empty For indicator splits, the C++ text dump encodes the default (missing) child as the "no" direction. Set Missing to match No instead of NaN. Also assert non_leaf is non-empty so the test guarantees coverage of indicator split parsing.

PavelGuzenfeld · 2026-03-17T09:41:56Z

@RAMitchell Good catch — the docstring (-OO) commit was accidentally stacked on this branch. Rebased to remove it; this PR now only contains the indicator-feature fix.

Add section documenting 3 merged PRs (dmlc#12087, dmlc#12089, dmlc#12094), 1 open PR (dmlc#12086), and 3 closed/superseded PRs.

RAMitchell requested a review from Copilot March 16, 2026 08:34

Copilot started reviewing on behalf of RAMitchell March 16, 2026 08:34 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

RAMitchell requested a review from Copilot March 16, 2026 08:59

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Copilot started reviewing on behalf of RAMitchell March 16, 2026 09:20 View session

RAMitchell requested a review from Copilot March 17, 2026 08:47

Copilot started reviewing on behalf of RAMitchell March 17, 2026 08:48 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

PavelGuzenfeld and others added 5 commits March 17, 2026 11:41

Fix ruff lint and bitwise XOR on float32 in indicator test

31abb2f

Remove blank line between import groups (ruff check) and cast float32 slices to int32 before applying bitwise XOR to fix TypeError on numpy versions that enforce safe casting.

Fix ruff format: join single-line expression

d9bc441

PavelGuzenfeld force-pushed the fix/trees-to-dataframe-indicator branch from 568f4cc to 2d1a71e Compare March 17, 2026 09:41

RAMitchell approved these changes Mar 17, 2026

View reviewed changes

RAMitchell merged commit 9c7bd35 into dmlc:master Mar 17, 2026
83 checks passed

PavelGuzenfeld deleted the fix/trees-to-dataframe-indicator branch March 18, 2026 21:03

PavelGuzenfeld added a commit to PavelGuzenfeld/xgboost that referenced this pull request Mar 19, 2026

Update tracker with existing contributions

eb07c85

Add section documenting 3 merged PRs (dmlc#12087, dmlc#12089, dmlc#12094), 1 open PR (dmlc#12086), and 3 closed/superseded PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python-package] Handle indicator (boolean) features in trees_to_dataframe#12089

[python-package] Handle indicator (boolean) features in trees_to_dataframe#12089
RAMitchell merged 5 commits intodmlc:masterfrom
thebandofficial:fix/trees-to-dataframe-indicator

PavelGuzenfeld commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

PavelGuzenfeld Mar 17, 2026

Uh oh!

PavelGuzenfeld Mar 17, 2026

Uh oh!

PavelGuzenfeld Mar 17, 2026

Uh oh!

RAMitchell commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

PavelGuzenfeld Mar 17, 2026

Uh oh!

PavelGuzenfeld Mar 17, 2026

Uh oh!

PavelGuzenfeld commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		X = rng.randint(0, 2, size=(n_samples, n_features)).astype(np.float32)
		y = (X[:, 0] ^ X[:, 1]).astype(np.float32)

Uh oh!

Conversation

PavelGuzenfeld commented Mar 15, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

PavelGuzenfeld Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

PavelGuzenfeld Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

PavelGuzenfeld Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

RAMitchell commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

PavelGuzenfeld Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

PavelGuzenfeld Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

PavelGuzenfeld commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants