Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
Dataset Used: Bank Marketing Dataset
Dataset ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: http://archive.ics.uci.edu/ml/datasets/bank+marketing
One source of potential performance benchmarks: https://www.kaggle.com/rouseguy/bankbalanced
INTRODUCTION: The Bank Marketing dataset involves predicting the whether the bank clients will subscribe (yes/no) a term deposit (target variable). It is a binary (2-class) classification problem. There are over 45,000 observations with 16 input variables and 1 output variable. There are no missing values in the dataset.
CONCLUSION: The take No.2 version of this banking dataset aims to test the removal of one attribute from the dataset and the effect. You can see the results from the take No.1 here on GitHub.
The data removed was the “duration” attribute. According to the dataset documentation, this attribute highly affects the output target (e.g., if duration=0 then y=”no”). However, the duration is not known before a call is performed. Also, after the end of the call, the target variable is naturally identified. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
The baseline performance of the ten algorithms achieved an average accuracy of 87.68% (vs. 89.13% from the take No.1). Three algorithms (Linear Regression, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy and Kappa scores during the initial modeling round. After a series of tuning trials with these three algorithms, Stochastic Gradient Boosting (SGB) achieve the top result using the training data. It produced an average accuracy of 89.49% (vs. 91.00% from the take No.1) using the training data.
SGB also processed the validation dataset with an accuracy of 89.21% (vs. 90.58% from the take No.1). For this project, the Stochastic Gradient Boosting ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm. The elimination of the “duration” attribute did not seem to have a substantial adverse effect on the overall accuracy of the prediction models.
The HTML formatted report can be found here on GitHub.