Random Forest (RF)

Ensemble Machine learning

Authors
Affiliations

Doctor of Physical Therapy

B.S. in Kinesiology

Doctor of Physical Therapy

B.A. in Neuroscience

Keywords

Random forest, Random Forest Intuition, Random forest regression, Ensemble learning

This is a form of “ensemble learning”

Random forest is a further refinement of classification and regression tree (CART) models1. RF is considered better than a CART model with a single tree based on the idea that a large group of randomized trees (a forest) can outperform a single “best” tree in a classification task1.

Process

  1. Pick at random “K” data points from a dataset
  2. Build the Decision tree associated to the “K” data points
  3. Choose the number Ntree of trees you want to build and repeat steps 1 & 2
  4. For a new data point, make each one of your Ntree trees predict the value of Y for the data point in question and assign the new data point the average across all of the predicted Y-values

Other

Should have a minimum of 500 trees

Multiple trees creates a “forest” and the program averages the results of the forest, which reduces the likelihood of error.

Contraindications

  • Determining factors impacting “Time-to-event” is not suited for random forest and is better suited for a Cox Proportional Hazards model1

Strengths

  • Greatest Strength of RF is its ability to find meaningful interactions and non-linear effects of the predictors1.

Application

Once the RF has been calculated, it can be used for prediction1.

Interpretation

The RF is a refined form of the CART model by increasing the number of trees and therefore complexity, but this increased complexity leads to difficulty with interpretability1. Most would find it difficult or at the very least unintuitive to determine which variables (x) are most strongly influencing the predictions1.

To remedy this, one must use “variable importance measures” to measure the impact of the variables in the model1. Some of these methods use a measure similar to the area under the curve known as a “Gini coefficient” to achieve this1. Alternatively, one could test the model while leaving out individual variables left out to see how the error rate is affected1.

Error rate

Parameters

Number of outcomes

Ideally, the dataset fed into the RF has an equal number of individuals who experienced the outcome (y) is equal to the number of right-censored cases (cases where the outcome has not occurred)1. In survival data, the dataset genearlly contains more survivors (right-centered) than deaths (outcome), thus it may be useful to use a sub-sample of the survivors1.

Number of variables (mtry)

One must also choose the number of variables (x) at each split. By default the mtry variable is \(\sqrt{\textrm{# of predictors}}\)1.

Number of Trees (ntree)

ntree is directly relevant to computational intensity1. Generally 100-1,000 is the standard1. Smaller values for ntree can be applied and may perform well1.

Maximum splits (nsplit)

nsplit refers to the max number of splits to try at each variable1. Limiting nsplit is particularly important when working with continuous variables since the algorithm would attempt to evaluate every possible split, which could require an extended amount of computing time1. Generally, setting the nsplit to 10, 50, or 100 can improve the computing time significantly1.

Variable importance measures

Finally, the variable importance measures rely on multiple iterations of the forest with and without each variable so collecting these measures is intensive. One approach is to collect these measures during initial model construction and to leave them out after that.

Learning Resources

References

1.
Rigatti SJ. Random Forest. Journal of Insurance Medicine (New York, NY). 2017;47(1):31-39. doi:10.17849/insm-47-01-31-39.1

Citation

For attribution, please cite this work as: