Multi-Class Classification Model for Forest Cover Type Using R Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of the seven possible outcomes.

INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

The actual forest cover type for a given observation (30 x 30-meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

In iteration Take1, we established the baseline accuracy for comparison with future rounds of modeling.

In iteration Take2, we plan to examine the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting the most important attributes, we hope to decrease the modeling time and still maintain a similar level of accuracy when compared to the baseline model.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 78.04%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 85.48%. By using the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 86.07%, which was even better than the predictions from the training data.

From the current iteration, the performance of the machine learning algorithms achieved an average accuracy of 74.27%. Random Forest achieved an accuracy metric of 85.47% with the training data and processed the testing dataset with an accuracy of 85.85%, which was even better than the predictions from the training data. At the importance level of 99%, the attribute importance technique eliminated 22 of 54 total attributes. The remaining 32 attributes produced a model that achieved a comparable accuracy compared to the baseline model. The modeling time went from 1 hour 19 minutes down to 58 minutes, a reduction of 36.2%.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.

Dataset Used: Covertype Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype

One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview

The HTML formatted report can be found here on GitHub.