-
classification tree ask a series of questions to answer a problem. It comes up with the values of all the features and then asks a series of questions on them to classify the points.
-
Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods.
-
Tree based approaches are simple but they are not as good as the state of the art approaches. Hence we have approaches like Bagging, Boosting and random forests which use consensus of many trees to arrive at a solution.
-
The idea of creating a decision tree is to divide the overall region into regions according to set of inequalities between the features and their values.
-
Its impossible to consider all the splits, hence we use a greedy strategy to split the tree at each time, so as to take the best decision at the current step using the split that reduces the MSE by the maximum.
-
But these methods can lead to overfitting as they can fir even the noise in the data
-
Hence, we do tree pruning.
- can handle huge data
- Can handle discrete or continuous data well
- Ignore redundant variables
- Can handle missing variables
- Bad predictions if high dimensions due to overfitting or large variance.
- Large trees are hard to interpret.
- Individual tree is very noisy, so we can aggregate various trees.
- All the trees are trying to solve the same problem so we can shake the data a little bit, by taking a random sample of data.
- and then we average the trees, this helps as the trees are different and hence lesser variance.
- Each tree gives the probability of the point we want to predict, we can average the trees to get the answer.
- An example d_i is classified to the class for which the majority of particular classifiers vote.
- The bagging method creates a sequence of classifiers Hm, m=1,…,M in respect to modifications of the training set. These classifiers are combined into a compound classifier. The prediction of the compound classifier is given as a weighted combination of individual classifier predictions: H(d_i) = sign( _&alpha m Hm di) where the important/precise classifiers have higher _&alpha
- Easier to implement
- better than bagging.
- Tries to add on top of bagging by creating the trees that are uncorelated as if the trees are uncorrelated, the variance gets reduced by 1/(no. of trees) while it depends in case of bagging.
- Dominates Bagging
[https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/trees.pdf]