[python-package] Added tests on Booster.shuffle_model()#7168
[python-package] Added tests on Booster.shuffle_model()#7168daguirre11 wants to merge 1 commit intolightgbm-org:masterfrom
Conversation
jameslamb
left a comment
There was a problem hiding this comment.
Thanks for starting this! This test should be made significantly stronger to give us high confidence that the behavior of shuffle_models() is correct.
I left some guidance in comments. But if you're feeling like it's too much for you to investigate right now, please let me know and we can close this so someone else can contribute it.
| train_set = lgb.Dataset(X_train, label=y_train) | ||
| booster = lgb.Booster( | ||
| params={"objective": "binary", "verbose": -1}, | ||
| train_set=train_set, | ||
| ) | ||
| for _ in range(2): | ||
| booster.update() |
There was a problem hiding this comment.
| train_set = lgb.Dataset(X_train, label=y_train) | |
| booster = lgb.Booster( | |
| params={"objective": "binary", "verbose": -1}, | |
| train_set=train_set, | |
| ) | |
| for _ in range(2): | |
| booster.update() | |
| booster = lgb.train( | |
| params={ | |
| "objective": "binary", | |
| "num_iterations": 10, | |
| "num_leaves": 7, | |
| "verbose": -1, | |
| }, | |
| train_set=lgb.Dataset(X, label=y), | |
| ) |
Let's use lgb.train() for this instead of a for loop and an update please, and let's make the model smaller so the test is faster.
| model_str_before = booster.model_to_string() | ||
| booster.shuffle_models(start_iteration=0, end_iteration=-1) | ||
| model_str_after = booster.model_to_string() | ||
| assert model_str_before != model_str_after |
There was a problem hiding this comment.
This is not a very strong test. For example, it'd pass if shuffle_models() corrupted the model in some serious and incorrect way. This should be made much stricter.
To do that, you'll have to look a bit deeper into what the function is doing. Start with the docstring:
LightGBM/python-package/lightgbm/basic.py
Lines 4509 to 4515 in e3d5270
The test should train 10 trees (for example) and:
- omit the first 2 trees, and confirm that their placement is not changed
- omit the final tree, and confirm that its placement isn't changed
- confirm that the set of trees is identical and only the ordering is different
- confirm that
booster.predict()(withstart_iterationleft at its default) produces identical results before and after (ordering should not affect the predictions if you predict with all trees) - check that the expected behavior happens if
start_iterationis negative orend_iterationis larger than the number of trees in the model
| X_train, _, y_train, _ = train_test_split( | ||
| *load_breast_cancer(return_X_y=True), | ||
| test_size=0.1, | ||
| random_state=42, | ||
| ) |
There was a problem hiding this comment.
| X_train, _, y_train, _ = train_test_split( | |
| *load_breast_cancer(return_X_y=True), | |
| test_size=0.1, | |
| random_state=42, | |
| ) | |
| X, y = load_breast_cancer(return_X_y=True) |
The test isn't using the held-out validation data, let's skip the unnecessary train-test splitting.
There was a problem hiding this comment.
@jameslamb sorry for the late response, I will address all your responses on PRs today. Thank you for reviewing!
Contributes to #7031

BEFORE
AFTER

2 line difference in coverage for
/lightgbm/basic.pyFrom my understanding
shuffle_modelsliterally just reorders the trees by calling the C APILGBM_BoosterShuffleModels. The predictions will be the same but the actual model itself will be different, hence, whymodel_to_string()is different before and after (Tree=0 and Tree=1 switch positions). Please let me know if I am understanding this correctly.