Skip to content

Add random uniform sentinels to avoid overfitting #4622

@Shulito

Description

@Shulito

Preface

Sorry if this feature is being added to the framework. I looked everywhere and it doesn't seem so, but it's difficult to search because searching for "random + tree" 99.99% of the time leads to random forest.

Summary

Instead of using the taditional hyperparameters to control overfitting (like max_depth), add random uniform feature variables that act as sentinels to check if the split of a node is going to lead to an overfitted tree.

Motivation and Description

Create N random (therefore, uncorrelated) uniform feature variables between 0 and 1 and add them to the dataset. If, when constructing one of the trees, one of this sentinel features is selected as the best feature to split the node over the real features of the dataset, that means that this node shouldn't be split because it found a spurious correlation that's better that any split of the real features. If this happens at the root, stop creating trees.

Alternatives

Enable the possibility to add user-defined predicate callbacks (with access to the environment) before a split happens and before a new tree is created for user defined behaviour to stop node splitting and tree creation.

References

https://www.kdnuggets.com/2019/10/feature-selection-beyond-feature-importance.html -> Feature Importance + Random Features section.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions