Random Forest

Pavini Jain
2 min readDec 14, 2021

--

Intuition

  1. Build a Random Forest:
  • Do this hundreds of times and make many decision trees:

▹Create a bootstrapped dataset:

✓ Randomly select from the dataset.

✓ Can pick a sample more than once

▹Create a Decision Tree using bootstrapped dataset but only use a random subset of variables (or column) at each step.

✓ Eg- Consider only 2 variables at each step.

✓ Build the decision tree.

  • In all the decision trees, see which option got more votes. That will be the answer. Bootstrapping the data plus using the aggregate to make a decision is called Bagging.

2. How to check the accuracy:

  • Typically, 1/3rd of the dataset does not end up in the bootstrapped dataset. That 1/3rd dataset is called the Out-of-boot dataset.
  • Run the dataset through all the trees on out of boot dataset to see if it correctly classifies the dataset.
  • Calculate how many out of bag were correctly labeled.
  • Ultimately, we can measure how accurate our random forest is by the proportion of Out-of-bag samples that were correctly classified by the random forest. The proportion of Out-of-Bag samples that were incorrectly classified is the Out-of-Bag Error.

3. Change the number of variables used per step. Then repeat all the steps bunch of times and then choose the one that is most accurate.

  • Typically, we start by using the square of the number of variables and then try a few settings above and below that value.

--

--

Pavini Jain

Student at Jaypee Institute of Information Technology