[python-package] fix misleading redundant parameter warnings in Booster.refit()#7124
[python-package] fix misleading redundant parameter warnings in Booster.refit()#7124arjunprakash027 wants to merge 20 commits intolightgbm-org:masterfrom
Conversation
jameslamb
left a comment
There was a problem hiding this comment.
Thanks for all the investigation you've done so far! Leaving a blocking review, as I'd like the opportunity to look into this and suggest a different fix.
This patch with a double-nested if block inside a for loop and uses of inspect and locals() looks quite complex, and I'm worried it'd be difficult to modify correctly in the future.
|
Thanks @jameslamb! |
jameslamb
left a comment
There was a problem hiding this comment.
Sorry for the long time to review... I'm ready now to work with you on this. Please see my suggestions and let me know if you have any questions.
python-package/lightgbm/basic.py
Outdated
| args_names = inspect.signature(Dataset._lazy_init).parameters.keys() | ||
| refit_signature = inspect.signature(self.__class__.refit).parameters | ||
| for lazy_init_args in args_names: | ||
| if lazy_init_args in refit_signature: | ||
| default_val = refit_signature[lazy_init_args].default | ||
| current_val = locals().get(lazy_init_args) | ||
| is_default = current_val is default_val | ||
|
|
||
| if is_default: | ||
| locals()[lazy_init_args] = new_params.get(lazy_init_args, current_val) | ||
| new_params.pop(lazy_init_args, None) |
There was a problem hiding this comment.
| args_names = inspect.signature(Dataset._lazy_init).parameters.keys() | |
| refit_signature = inspect.signature(self.__class__.refit).parameters | |
| for lazy_init_args in args_names: | |
| if lazy_init_args in refit_signature: | |
| default_val = refit_signature[lazy_init_args].default | |
| current_val = locals().get(lazy_init_args) | |
| is_default = current_val is default_val | |
| if is_default: | |
| locals()[lazy_init_args] = new_params.get(lazy_init_args, current_val) | |
| new_params.pop(lazy_init_args, None) | |
| # 'categorical_feature' can end up in self.params when a Booster | |
| # is created from a model string or file... pre-process to ensure it's passed | |
| # via a keyword argument to the Dataset constructor instead of 'params'. | |
| if "categorical_feature" in new_params: | |
| cat_features_from_params = new_params.pop("categorical_feature") | |
| if categorical_feature == "auto" or cat_features_from_params == categorical_feature: | |
| categorical_feature = cat_features_from_params | |
| else: | |
| error_msg = ( | |
| "'categorical_feature' value passed to Booster.refit() is different from " | |
| "'categorical_feature' value found in Booster.params. " | |
| "Preferring the value passed via keyword argument. " | |
| "Using refit() to change which columns are treated as categorical is not supported. " | |
| "If you have a valid use case for this, please open an issue at https://github.com/lightgbm-org/LightGBM/issues." | |
| ) | |
| raise LightGBMError(error_msg) |
I've investigated this see #6793 (comment)
This code using locals() and inspect.signature() is quite complex, especially just to avoid a warning.
categorical_feature is the only value that conflicts between both parameters: in the model string and the signature of Dataset._lazy_init()... let's just add specific handling for that.
(note: we don't need to care about the parameter aliases like cat_feature, categorical_column, etc. listed at https://lightgbm.readthedocs.io/en/latest/Parameters.html#categorical_feature, because only the main non-alias parameters are written to model strings / files).
Proposing the following:
- use this code I've suggested above, which reconciles the
categorical_featurekeyword arg and entry inself.params - add a unit test called
test_refit_does_not_warn_about_categorical_featuresintest_basic.pytesting the following:- no warnings raised (use something like the reproducible example from [python-package]
Booster.refit()raises a misleading warning when using categorical features #6793 (comment)) Booster.params["categorical_feature"]is unchanged- the expected error message is raised if the keyword argument and value in
paramsdiffer refit()
- no warnings raised (use something like the reproducible example from [python-package]
Please let me know if you have any questions. If you don't have time or interest in continuing this, please let me know and I'll push the changes to your branch here...You've put so much effort into this already and your investigation helped us to find the root cause, so one way or another I want you to get credit for this commit to LightGBM.
There was a problem hiding this comment.
Hey @jameslamb - Seems doable and I have no doubts until now.
I'll make the changes and notify.
Thanks for alternative suggestion for the fix
There was a problem hiding this comment.
I can also write the test for this right?
There was a problem hiding this comment.
Hey @jameslamb,
I've implemented the workaround we discussed to handle the categorical_feature parameter when loading a model from a string/text file.
Alongside the fix, I've added a new test (test_refit_does_not_warn_about_categorical_features). The test verifies three key things:
- That the categorical_feature loaded from the text file matches the one passed during the initial Dataset creation.
- That passing a conflicting categorical_feature list during refit() correctly raises a LightGBMError with the expected message.
- That refit() functions normally (without warnings/errors) when relying on the default parameter fallback.
For the test setup, I made sure to use the existing rng fixture in conftest.py for seeded data generation, and utilized PyTest's builtin tmp_path fixture for automatic creation and cleanup of the sample_model.txt file.
Let me know if the coding style looks alright to you or if you need me to adjust anything!
There was a problem hiding this comment.
Oh and, the below command will run the test
pytest 'tests/python_package_test/test_basic.py::test_refit_does_not_warn_about_categorical_features'
There was a problem hiding this comment.
Thanks for your work on this!
I went to review tonight and realized there were some additional changes needed, mainly to handle LightGBM's "parameter aliases". See https://lightgbm.readthedocs.io/en/latest/Parameters.html#categorical_feature ... any of categorical_feature, cat_feature, categorical_column, cat_column, categorical_features found in params all mean the same thing, and that has to be accounted for.
There is also a surprising (and slightly inconsistent) behavior I stumbled over where lightgbm is manually updating params using an alias ("categorical_column"),
LightGBM/python-package/lightgbm/basic.py
Line 2170 in d14c4ba
...something we try hard not to do in the library, especially because the "main" (non-alias) parameter names are written to model strings / files.
The changes needed require a pretty deep understanding of LightGBM's internals, so I decided to just push them directly myself: 9500c4d
Please take a look when you have time, and ask any questions you have. If you don't see any issues, and if CI passes, I'll merge this. (I'd usually ask for a review from another maintainer, but the repo isn't very active right now and this is a small change in a relatively rarely-used part of the API, so I think it's ok).
There was a problem hiding this comment.
Makes sense to me! thank you for the change.
Rookie mistake, I did not consider the alias and column name to index resolution done by lgbm before thinking about the fix.
Test looks better now (I was bit concerned about using a temp dir, its fixed now, thank you)
There was a problem hiding this comment.
No problem, thanks for looking and for getting us this far!
It just happened that fixing this required going fairly deep into LightGBM... the "parameter alias" thing is complex. It's a frequent source of bugs, and I even found another one while working through this (will address it in a follow-up PR).
If you're interested in contributing more after this, I'd be happy to suggest some other areas that would help you gain familiarity with the project more gradually.
There was a problem hiding this comment.
Ofcourse James, I'll be open to helping and contributing!
Thank you
… add validation test case
|
One other note... in the future when you contribute to When this PR is merged, |
Fix Misleading Redundant Parameter Warnings in
Booster.refit()Fixes #6793
Problem
As discussed in #6793 , the
Booster.refit()method raises misleading warnings about redundant parameters being passed to the internalDatasetconstructor. The issue occurs when parameters likecategorical_featureorlabel_columnare stored in the Booster's internalparamsdictionary and then inadvertently passed both as keyword arguments and within theparamsdictionary toDataset(), triggering false-positive warnings.For additional context, see #6793 (comment)
Solution
This PR modifies
Booster.refit()to implement parameter routing:Parameter Inspection: Uses
inspect.signature()to dynamically identify which parameters belong to theDatasetconstructor by examiningDataset._lazy_init.Parameter Routing: Checks if a Dataset-related parameter (such as
categorical_featureorlabel_column) exists in the Booster's internalparamsbut has not been explicitly overridden by the user in therefit()call. When a parameter qualifies for routing (i.e., it's a Dataset parameter with a default value in therefit()signature), it is moved fromnew_paramsto the local variable scope for theDatasetconstructor. This ensures each parameter is passed exactly once to the internalDatasetobject, eliminating the redundant warning.I've tried to resolve the warning issue without changing the actual behavior of the refitting process.