Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a
prediction model using various machine learning algorithms and to document the
end-to-end steps using a template. The Ames Iowa Housing Prices dataset is a
regression situation where we are trying to predict the value of a continuous
INTRODUCTION: Many factors can influence a home’s purchase
price. This Ames Housing dataset contains 79 explanatory variables describing
every aspect of residential homes in Ames, Iowa. The goal is to predict the
final price of each home.
In iteration Take1, we established the baseline mean squared
error for further takes of modeling.
In iteration Take2, we converted some of the categorical
variables from nominal to ordinal and observed the effects of the change.
In iteration Take3, we examined the feature selection
technique of attribute importance ranking by using the Gradient Boosting
algorithm. By selecting only the most important attributes, we decreased the
processing time and maintained a similar level of RMSE compared to the
In this iteration, we will examine the feature selection
technique of recursive feature elimination (RFE) by using the Gradient Boosting
algorithm. By selecting up to 50 attributes, we hope to decrease the processing
time and maintain a similar level of RMSE compared to the baseline.
ANALYSIS: The baseline performance of the machine learning
algorithms achieved an average RMSE of 31,172. Two algorithms (Ridge Regression
and Gradient Boosting) achieved the top RMSE metrics after the first round of
modeling. After a series of tuning trials, Gradient Boosting turned in the best
overall result and achieved an RMSE metric of 24,165. By using the optimized
parameters, the Gradient Boosting algorithm processed the test dataset with an
RMSE of 21,067, which was even better than the prediction from the training
From iteration Take2, Gradient Boosting achieved an RMSE
metric of 23,612 with the training dataset and processed the test dataset with
an RMSE of 21,130. Converting the nominal variables to ordinal did not have a
material impact on the prediction accuracy in either direction.
From iteration Take3, Gradient Boosting achieved an RMSE
metric of 24,045 with the training dataset and processed the test dataset with
an RMSE of 21,994. At the importance level of 99%, the attribute importance
technique eliminated 222 of 258 total attributes. The remaining 36 attributes
produced a model that achieved a comparable RMSE to the baseline model. The processing
time for Take2 also reduced by 67.90% compared to the Take1 iteration.
From iteration Take4, Gradient Boosting achieved an RMSE
metric of 23,825 with the training dataset and processed the test dataset with
an RMSE of 21,898. The RFE technique eliminated 208 of 258 total attributes.
The remaining 50 attributes produced a model that achieved a comparable RMSE to
the baseline model. The processing time for Take3 also reduced by 1.8% compared
to the Take1 iteration.
CONCLUSION: For this iteration, the Gradient Boosting
algorithm achieved the best overall results using the training and testing
datasets. For this dataset, Gradient Boosting should be considered for further
Dataset Used: Kaggle Competition – House Prices: Advanced
Dataset ML Model: Regression with numerical and categorical
One potential source of performance benchmarks:
The HTML formatted report can be found here on GitHub.