Random Forest (RF)

Ensemble Machine learning

Authors

Affiliations

Nate Yomogida, B.S., SPT

Doctor of Physical Therapy

B.S. in Kinesiology

Chloë Kerstein, B.S., SPT

Doctor of Physical Therapy

B.A. in Neuroscience

Keywords

Random forest, Random Forest Intuition, Random forest regression, Ensemble learning

This is a form of “ensemble learning”

Random forest is a further refinement of classification and regression tree (CART) models¹. RF is considered better than a CART model with a single tree based on the idea that a large group of randomized trees (a forest) can outperform a single “best” tree in a classification task¹.

Process

Pick at random “K” data points from a dataset
Build the Decision tree associated to the “K” data points
Choose the number Ntree of trees you want to build and repeat steps 1 & 2
For a new data point, make each one of your Ntree trees predict the value of Y for the data point in question and assign the new data point the average across all of the predicted Y-values

Other

Should have a minimum of 500 trees

Multiple trees creates a “forest” and the program averages the results of the forest, which reduces the likelihood of error.

Contraindications

Determining factors impacting “Time-to-event” is not suited for random forest and is better suited for a Cox Proportional Hazards model¹

Strengths

Greatest Strength of RF is its ability to find meaningful interactions and non-linear effects of the predictors¹.

Application

Once the RF has been calculated, it can be used for prediction¹.

Interpretation

The RF is a refined form of the CART model by increasing the number of trees and therefore complexity, but this increased complexity leads to difficulty with interpretability¹. Most would find it difficult or at the very least unintuitive to determine which variables (x) are most strongly influencing the predictions¹.

To remedy this, one must use “variable importance measures” to measure the impact of the variables in the model¹. Some of these methods use a measure similar to the area under the curve known as a “Gini coefficient” to achieve this¹. Alternatively, one could test the model while leaving out individual variables left out to see how the error rate is affected¹.

Error rate

Parameters

Number of outcomes

Ideally, the dataset fed into the RF has an equal number of individuals who experienced the outcome (y) is equal to the number of right-censored cases (cases where the outcome has not occurred)¹. In survival data, the dataset genearlly contains more survivors (right-centered) than deaths (outcome), thus it may be useful to use a sub-sample of the survivors¹.

Number of variables (mtry)

One must also choose the number of variables (x) at each split. By default the mtry variable is $\sqrt{\textrm{# of predictors}}$ ¹.

Number of Trees (ntree)

ntree is directly relevant to computational intensity¹. Generally 100-1,000 is the standard¹. Smaller values for ntree can be applied and may perform well¹.

Maximum splits (nsplit)

nsplit refers to the max number of splits to try at each variable¹. Limiting nsplit is particularly important when working with continuous variables since the algorithm would attempt to evaluate every possible split, which could require an extended amount of computing time¹. Generally, setting the nsplit to 10, 50, or 100 can improve the computing time significantly¹.

Variable importance measures

Finally, the variable importance measures rely on multiple iterations of the forest with and without each variable so collecting these measures is intensive. One approach is to collect these measures during initial model construction and to leave them out after that.

Learning Resources

Kaggle RF Tutorial

References

Rigatti SJ. Random Forest. Journal of Insurance Medicine (New York, NY). 2017;47(1):31-39. doi:10.17849/insm-47-01-31-39.1

Citation

For attribution, please cite this work as:

Yomogida N, Kerstein C. Random Forest (RF). https://yomokerst.com/The Archive/Evidene Based Practice/Data science/Machine learning/random_forest_regression.html