Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Ames Iowa Housing Prices dataset is a regression situation where we are trying to predict the value of a continuous variable.
INTRODUCTION: Many factors can influence a home’s purchase price. This Ames Housing dataset contains 79 explanatory variables describing every aspect of residential homes in Ames, Iowa. The goal is to predict the final price of each home.
In iteration Take1, we established the baseline mean squared error for further takes of modeling.
In iteration Take2, we converted some of the categorical variables from nominal to ordinal and observed the effects of the change.
In iteration Take3, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting only the most important attributes, we decreased the processing time and maintained a similar level of RMSE compared to the baseline.
In iteration Take4, we examined the feature selection technique of recursive feature elimination (RFE) by using the Gradient Boosting algorithm. By selecting up to 100 attributes, we decreased the processing time and maintained a similar level of RMSE compared to the baseline.
In iteration Take5, we constructed several Multilayer Perceptron (MLP) models with one, two, and three hidden layers. We also observed how the different model architectures affect the RMSE metric.
In this Take6 iteration, we will add Dropout layers to our Multilayer Perceptron (MLP) models. We will observe how the Dropout layers affect the RMSE metric.
ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 31,172. Two algorithms (Ridge Regression and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the best overall result and achieved an RMSE metric of 24,165. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 21,067, which was even better than the prediction from the training data.
In iteration Take2, Gradient Boosting achieved an RMSE metric of 23,612 with the training dataset and processed the test dataset with an RMSE of 21,130. Converting the nominal variables to ordinal did not have a material impact on the prediction accuracy in either direction.
In iteration Take3, Gradient Boosting achieved an RMSE metric of 24,045 with the training dataset and processed the test dataset with an RMSE of 21,994. At the importance level of 99%, the attribute importance technique eliminated 222 of 258 total attributes. The remaining 36 attributes produced a model that achieved a comparable RMSE to the baseline model. The processing time for Take2 also reduced by 67.90% compared to the Take1 iteration.
In iteration Take4, Gradient Boosting achieved an RMSE metric of 23,825 with the training dataset and processed the test dataset with an RMSE of 21,898. The RFE technique eliminated 208 of 258 total attributes. The remaining 50 attributes produced a model that achieved a comparable RMSE to the baseline model. The processing time for Take3 also reduced by 1.8% compared to the Take1 iteration.
In iteration Take5, all models processed the test dataset and produced an RMSE near or around the 23,000 level. The two-layer model with 128 and 64 nodes (Model 2C) was able to achieve the best RMSE of 22,708 using the test dataset. All models eventually overfit, and the models with more layers overfit much faster than the simpler models.
In this Take6 iteration, all models again processed the test dataset and produced an RMSE near or around the 23,000 level. All models eventually overfit, but the Dropout layers can help by reducing overfitting.
CONCLUSION: For this iteration, the addition of Dropout layers produced similar RMSEs for all models. For this dataset, we should consider experimenting with more regularization techniques.
Dataset Used: Kaggle Competition – House Prices: Advanced Regression Techniques
Dataset ML Model: Regression with numerical and categorical attributes
Dataset Reference: https://ww2.amstat.org/publications/jse/v19n3/decock.pdf
One potential source of performance benchmarks: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The HTML formatted report can be found here on GitHub.