- Build a Random Forest:
- Do this hundreds of times and make many decision trees:
▹Create a bootstrapped dataset:
✓ Randomly select from the dataset.
✓ Can pick a sample more than once
▹Create a Decision Tree using bootstrapped dataset but only use a random subset of variables (or column) at each step.
✓ Eg- Consider only 2 variables at each step.
✓ Build the decision tree.
- In all the decision trees, see which option got more votes. That will be the answer. Bootstrapping the data plus using the aggregate to make a decision is called Bagging.
2. How to check the accuracy:
- Typically, 1/3rd of the dataset does not end up in the bootstrapped dataset. That 1/3rd dataset is called the Out-of-boot dataset.
- Run the dataset through all the trees on out of boot dataset to see if it correctly classifies the dataset.
- Calculate how many out of bag were correctly labeled.
- Ultimately, we can measure how accurate our random forest is by the proportion of Out-of-bag samples that were correctly classified by the random forest. The proportion of Out-of-Bag samples that were incorrectly classified is the Out-of-Bag Error.
3. Change the number of variables used per step. Then repeat all the steps bunch of times and then choose the one that is most accurate.
- Typically, we start by using the square of the number of variables and then try a few settings above and below that value.