Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Superconductivity Temperature dataset is a regression situation where we are trying to predict the value of a continuous variable.
INTRODUCTION: The research team wishes to create a statistical model for predicting the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The model seeks to examine the features that can contribute the most to the model’s predictive accuracy.
In iteration Take1, we established the baseline mean squared error for comparison with the future rounds of modeling.
In iteration Take2, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting only the most important attributes, we decreased the modeling time and still maintained a similar level of RMSE compared to the baseline model.
In this iteration, we will examine the feature selection technique of recursive feature elimination (RFE) by using the Bagged Trees algorithm. By selecting no more than 50 attributes, we hope to maintain a similar level of RMSE compared to the baseline model.
ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 16.33. Two algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the best overall result and achieved an RMSE metric of 9.72. By using the optimized parameters, the Random Forest algorithm processed the test dataset with an RMSE of 9.40, which was even better than the prediction from the training data.
From iteration Take2, the average performance of the machine learning algorithms achieved an RMSE of 16.40. Random Forest achieved an RMSE metric of 9.73 with the training dataset and processed the test dataset with an RMSE of 9.39. At the importance level of 99%, the attribute importance technique eliminated 10 of 81 total attributes. The remaining 71 attributes produced a model that achieved a comparable RMSE to the baseline model. The modeling time went from 6 hours 26 minutes down to 5 hours 50 minutes, a saving of 9.3%.
From iteration Take3, the average performance of the machine learning algorithms achieved an RMSE of 16.63. Random Forest achieved an RMSE metric of 9.77 with the training dataset and processed the test dataset with an RMSE of 9.43. At the importance level of 99%, the attribute importance technique eliminated 35 of 81 total attributes. The remaining 46 attributes produced a model that achieved a comparable RMSE to the baseline model. The modeling time went from 6 hours 26 minutes down to 3 hours 51 minutes, a saving of 40.1%.
CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.
Dataset Used: Superconductivity Data Set
Dataset ML Model: Regression with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data
One potential source of performance benchmarks: https://doi.org/10.1016/j.commatsci.2018.07.052
The HTML formatted report can be found here on GitHub.