-
Notifications
You must be signed in to change notification settings - Fork 4k
missing values during prediction should throw an Exception if missing data wasn't present during training #4040
Description
Summary
When we encounter missing values during prediction after training without any missing data, the model predicts these examples without logging a warning or an exception (see issue [this issue])(#2921):
- Missing numerical values set to zero
- Missing categorical values are sent to the right leaf
As a result, several or all features could be missing and the model would still return a prediction (of unknown quality).
I propose to at least log a warning and allow the model to be configured in a strict mode where unexpected missing values lead to an exception (I would argue this should be the default, but it might not work for compatibility reasons).
Motivation
Changing the current behavior is important for using lightgbm in production. When working with a train-test-split missing data in the testset is easily recognized and a difference between test and train is less likely than a difference between training and production data.
In production, data or code bugs can lead to one or multiple features being missing. In my experience, bugs that change the data happen as commonly as other bugs.
The current behavior would silently impute them to zero (numerical case) or assign them to an existing leaf (categorical case). The model would silently misbehave and it could be hard to detect, especially if the bug is only on the inference side, but not on the training data (which is typical when the data is not coming from a common feature store).
Description
Throw an exception when missing values are seen during inference but not during training. Value imputing should probably be done before calling the model, so I propose to make throwing an exception the default behaviour.
You might not agree (hence the current implementation), so maybe it could be an option to log a warning?
It would also be great to improve the documentation on the use_missing=false flag:
set this to false to disable the special handle of missing value
The doc string doesn't give an explanation of what is done instead during training and inference when disabling the special value handling.
References
#2921
https://lightgbm.readthedocs.io/en/latest/Parameters.html