-
Notifications
You must be signed in to change notification settings - Fork 4k
Add random uniform sentinels to avoid overfitting #4622
Description
Preface
Sorry if this feature is being added to the framework. I looked everywhere and it doesn't seem so, but it's difficult to search because searching for "random + tree" 99.99% of the time leads to random forest.
Summary
Instead of using the taditional hyperparameters to control overfitting (like max_depth), add random uniform feature variables that act as sentinels to check if the split of a node is going to lead to an overfitted tree.
Motivation and Description
Create N random (therefore, uncorrelated) uniform feature variables between 0 and 1 and add them to the dataset. If, when constructing one of the trees, one of this sentinel features is selected as the best feature to split the node over the real features of the dataset, that means that this node shouldn't be split because it found a spurious correlation that's better that any split of the real features. If this happens at the root, stop creating trees.
Alternatives
Enable the possibility to add user-defined predicate callbacks (with access to the environment) before a split happens and before a new tree is created for user defined behaviour to stop node splitting and tree creation.
References
https://www.kdnuggets.com/2019/10/feature-selection-beyond-feature-importance.html -> Feature Importance + Random Features section.