Skip to content

[python-package] fix misleading redundant parameter warnings in Booster.refit()#7124

Open
arjunprakash027 wants to merge 20 commits intolightgbm-org:masterfrom
arjunprakash027:master
Open

[python-package] fix misleading redundant parameter warnings in Booster.refit()#7124
arjunprakash027 wants to merge 20 commits intolightgbm-org:masterfrom
arjunprakash027:master

Conversation

@arjunprakash027
Copy link
Copy Markdown

@arjunprakash027 arjunprakash027 commented Jan 11, 2026

Fix Misleading Redundant Parameter Warnings in Booster.refit()

Fixes #6793

Problem

As discussed in #6793 , the Booster.refit() method raises misleading warnings about redundant parameters being passed to the internal Dataset constructor. The issue occurs when parameters like categorical_feature or label_column are stored in the Booster's internal params dictionary and then inadvertently passed both as keyword arguments and within the params dictionary to Dataset(), triggering false-positive warnings.

For additional context, see #6793 (comment)

Solution

This PR modifies Booster.refit() to implement parameter routing:

  1. Parameter Inspection: Uses inspect.signature() to dynamically identify which parameters belong to the Dataset constructor by examining Dataset._lazy_init.

  2. Parameter Routing: Checks if a Dataset-related parameter (such as categorical_feature or label_column) exists in the Booster's internal params but has not been explicitly overridden by the user in the refit() call. When a parameter qualifies for routing (i.e., it's a Dataset parameter with a default value in the refit() signature), it is moved from new_params to the local variable scope for the Dataset constructor. This ensures each parameter is passed exactly once to the internal Dataset object, eliminating the redundant warning.

I've tried to resolve the warning issue without changing the actual behavior of the refitting process.

Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the investigation you've done so far! Leaving a blocking review, as I'd like the opportunity to look into this and suggest a different fix.

This patch with a double-nested if block inside a for loop and uses of inspect and locals() looks quite complex, and I'm worried it'd be difficult to modify correctly in the future.

@arjunprakash027
Copy link
Copy Markdown
Author

Thanks @jameslamb!
I agree, The complexity of code I wrote is overkill for this fix. I can try and write a simple fix, and also investigate if the problem is booster.refit() specific or if its an general problem.
Or, I can wait for your suggestion for a different fix.

Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long time to review... I'm ready now to work with you on this. Please see my suggestions and let me know if you have any questions.

Comment on lines +4910 to +4920
args_names = inspect.signature(Dataset._lazy_init).parameters.keys()
refit_signature = inspect.signature(self.__class__.refit).parameters
for lazy_init_args in args_names:
if lazy_init_args in refit_signature:
default_val = refit_signature[lazy_init_args].default
current_val = locals().get(lazy_init_args)
is_default = current_val is default_val

if is_default:
locals()[lazy_init_args] = new_params.get(lazy_init_args, current_val)
new_params.pop(lazy_init_args, None)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args_names = inspect.signature(Dataset._lazy_init).parameters.keys()
refit_signature = inspect.signature(self.__class__.refit).parameters
for lazy_init_args in args_names:
if lazy_init_args in refit_signature:
default_val = refit_signature[lazy_init_args].default
current_val = locals().get(lazy_init_args)
is_default = current_val is default_val
if is_default:
locals()[lazy_init_args] = new_params.get(lazy_init_args, current_val)
new_params.pop(lazy_init_args, None)
# 'categorical_feature' can end up in self.params when a Booster
# is created from a model string or file... pre-process to ensure it's passed
# via a keyword argument to the Dataset constructor instead of 'params'.
if "categorical_feature" in new_params:
cat_features_from_params = new_params.pop("categorical_feature")
if categorical_feature == "auto" or cat_features_from_params == categorical_feature:
categorical_feature = cat_features_from_params
else:
error_msg = (
"'categorical_feature' value passed to Booster.refit() is different from "
"'categorical_feature' value found in Booster.params. "
"Preferring the value passed via keyword argument. "
"Using refit() to change which columns are treated as categorical is not supported. "
"If you have a valid use case for this, please open an issue at https://github.com/lightgbm-org/LightGBM/issues."
)
raise LightGBMError(error_msg)

I've investigated this see #6793 (comment)

This code using locals() and inspect.signature() is quite complex, especially just to avoid a warning.

categorical_feature is the only value that conflicts between both parameters: in the model string and the signature of Dataset._lazy_init()... let's just add specific handling for that.

(note: we don't need to care about the parameter aliases like cat_feature, categorical_column, etc. listed at https://lightgbm.readthedocs.io/en/latest/Parameters.html#categorical_feature, because only the main non-alias parameters are written to model strings / files).

Proposing the following:

  • use this code I've suggested above, which reconciles the categorical_feature keyword arg and entry in self.params
  • add a unit test called test_refit_does_not_warn_about_categorical_features in test_basic.py testing the following:

Please let me know if you have any questions. If you don't have time or interest in continuing this, please let me know and I'll push the changes to your branch here...You've put so much effort into this already and your investigation helped us to find the root cause, so one way or another I want you to get credit for this commit to LightGBM.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jameslamb - Seems doable and I have no doubts until now.
I'll make the changes and notify.
Thanks for alternative suggestion for the fix

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also write the test for this right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jameslamb,

I've implemented the workaround we discussed to handle the categorical_feature parameter when loading a model from a string/text file.

Alongside the fix, I've added a new test (test_refit_does_not_warn_about_categorical_features). The test verifies three key things:

  1. That the categorical_feature loaded from the text file matches the one passed during the initial Dataset creation.
  2. That passing a conflicting categorical_feature list during refit() correctly raises a LightGBMError with the expected message.
  3. That refit() functions normally (without warnings/errors) when relying on the default parameter fallback.

For the test setup, I made sure to use the existing rng fixture in conftest.py for seeded data generation, and utilized PyTest's builtin tmp_path fixture for automatic creation and cleanup of the sample_model.txt file.

Let me know if the coding style looks alright to you or if you need me to adjust anything!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and, the below command will run the test
pytest 'tests/python_package_test/test_basic.py::test_refit_does_not_warn_about_categorical_features'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this!

I went to review tonight and realized there were some additional changes needed, mainly to handle LightGBM's "parameter aliases". See https://lightgbm.readthedocs.io/en/latest/Parameters.html#categorical_feature ... any of categorical_feature, cat_feature, categorical_column, cat_column, categorical_features found in params all mean the same thing, and that has to be accounted for.

There is also a surprising (and slightly inconsistent) behavior I stumbled over where lightgbm is manually updating params using an alias ("categorical_column"),

params["categorical_column"] = sorted(categorical_indices)

...something we try hard not to do in the library, especially because the "main" (non-alias) parameter names are written to model strings / files.

The changes needed require a pretty deep understanding of LightGBM's internals, so I decided to just push them directly myself: 9500c4d

Please take a look when you have time, and ask any questions you have. If you don't see any issues, and if CI passes, I'll merge this. (I'd usually ask for a review from another maintainer, but the repo isn't very active right now and this is a small change in a relatively rarely-used part of the API, so I think it's ok).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! thank you for the change.
Rookie mistake, I did not consider the alias and column name to index resolution done by lgbm before thinking about the fix.
Test looks better now (I was bit concerned about using a temp dir, its fixed now, thank you)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, thanks for looking and for getting us this far!

It just happened that fixing this required going fairly deep into LightGBM... the "parameter alias" thing is complex. It's a frequent source of bugs, and I even found another one while working through this (will address it in a follow-up PR).

If you're interested in contributing more after this, I'd be happy to suggest some other areas that would help you gain familiarity with the project more gradually.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ofcourse James, I'll be open to helping and contributing!
Thank you

@jameslamb
Copy link
Copy Markdown
Member

One other note... in the future when you contribute to git-based projects, create a new branch for each of your contributions instead of using the default branch (in this case, master).

When this PR is merged, master here and master on your fork will have incompatible histories, and you'll probably want to delete your fork and create a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[python-package] Booster.refit() raises a misleading warning when using categorical features

2 participants