Binary Classification Model for Bank Marketing Using Python, Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

Dataset Used: Bank Marketing Dataset

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: http://archive.ics.uci.edu/ml/datasets/bank+marketing

One source of potential performance benchmarks: https://www.kaggle.com/rouseguy/bankbalanced

INTRODUCTION: The Bank Marketing dataset involves predicting the whether the bank clients will subscribe (yes/no) a term deposit (target variable). It is a binary (2-class) classification problem. There are over 41,000 observations with 19 input variables and 1 output variable. There are no missing values within the dataset. This dataset is based on “Bank Marketing” UCI dataset and is enriched by the addition of five new social and economic features/attributes. This dataset is almost identical to the one without the five new attributes.

CONCLUSION: The take No.3 version of this banking dataset aims to test the addition of five additional social-economical attributes to the dataset and the effect. You can see the results from the take No.2 here on GitHub.

The baseline performance of the ten algorithms achieved an average accuracy of 88.32% (vs. 87.68% from the take No.2 version). Three algorithms (Logistic Regression, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy and Kappa scores during the initial modeling round. After a series of tuning trials with these three algorithms, Stochastic Gradient Boosting achieved the top accuracy/Kappa result using the training data. It produced an average accuracy of 90.06% (vs. 89.49% from the take No.2 version) using the training data.

Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 90.25%, which was sufficiently close to the training result. For this project, the Stochastic Gradient Boosting ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm. The addition of the social-economical attributes did not seem to have a substantial effect on the overall accuracy of the prediction models.

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Bank Marketing Using R, Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

Dataset Used: Bank Marketing Data Set

Data Set ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: http://archive.ics.uci.edu/ml/datasets/bank+marketing

One source of potential performance benchmarks: https://www.kaggle.com/rouseguy/bankbalanced

INTRODUCTION: The Bank Marketing dataset involves predicting the whether the bank clients will subscribe (yes/no) a term deposit (target variable). It is a binary (2-class) classification problem. There are over 45,000 observations with 16 input variables and 1 output variable. There are no missing values within the dataset.

CONCLUSION: The take No.2 version of this banking dataset aims to test the removal of one attribute from the dataset and the effect. You can see the results from the take No.1 here on GitHub.

The data removed was the “duration” attribute. According to the dataset documentation, this attribute highly affects the output target (e.g., if duration=0 then y=“no”). However, the duration is not known before a call is performed. Also, after the end of the call, the target variable is naturally identified. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

The baseline performance of the seven algorithms achieved an average accuracy of 89.22% (vs. 89.99% from the take No.1). Three algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy and Kappa scores during the initial modeling round. After a series of tuning trials with these three algorithms, Stochastic Gradient Boosting achieved the top accuracy/Kappa result using the training data. It produced an average accuracy of 89.46% (vs. 90.63% from the take No.1) using the training data.

Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 89.18%, which was sufficiently close to the training result. For this project, the Stochastic Gradient Boosting ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm. The elimination of the “duration” attribute did not seem to have a substantial adverse effect on the overall accuracy of the prediction models.

The HTML formatted report can be found here on GitHub.

Simple Classification Model for Diabetes Prediction Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

For more information on this case study project, please consult Dr. Brownlee’s blog post at https://machinelearningmastery.com/standard-machine-learning-datasets/.

Dataset Used: Pima Indians Diabetes Database

Data Set ML Model: Classification with numerical attributes

Dataset Reference: https://www.kaggle.com/uciml/pima-indians-diabetes-database

For more information on performance benchmarks, please consult: https://www.kaggle.com/uciml/pima-indians-diabetes-database

INTRODUCTION: The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details. It is a binary (2-class) classification problem. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values.

CONCLUSION: The baseline performance of predicting the class variable achieved an average accuracy of 75.85%. The top accuracy result achieved via Logistic Regression was 77.73% after a series of tuning trials. The ensemble algorithms, in this case, did not yield a better result than the non-ensemble algorithms to justify the additional processing required.

The HTML formatted report can be found here on GitHub.

Simple Classification Model for Glass Type Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

https://machinelearningmastery.com/standard-machine-learning-datasets/.

Dataset Used: Glass Identification Data Set

Data Set ML Model: Classification with real number attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Glass+Identification

For more information on this case study project and performance benchmarks, please consult: https://www.kaggle.com/uciml/glass

The glass identification dataset involves predicting the six types of glass, defined by their oxide content (i.e., Na, Fe, K, .and so forth). The criminological investigation was the motivation for the study of the classification of types of glass. At the scene of the crime, the glass left can be used as evidence, if it is correctly identified!

CONCLUSION: The baseline performance of predicting the class variable achieved an average accuracy of 71.45%. The top accuracy result achieved via RandomForest was 80.11% after a series of tuning trials. The ensemble algorithm, in this case, yielded a better result than the non-ensemble algorithms to justify the additional processing and tuning.

The HTML formatted report can be found here on GitHub.

Simple Regression Model for Predicting Abalone Age Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

For more information on this case study project, please consult Dr. Brownlee’s blog post at https://machinelearningmastery.com/standard-machine-learning-datasets/.

Dataset Used: Abalone Data Set

Data Set ML Model: Regression with Categorical, Integer, Real attributes

Dataset Reference: http://archive.ics.uci.edu/ml/datasets/Abalone

The Abalone Dataset involves predicting the age of abalone given objective measures of individuals. Although it was presented as a multi-class classification problem, this exercise will frame it using regression. The baseline performance of predicting the mean value is an RMSE of approximately 3.2 rings.

CONCLUSION: The baseline performance of predicting the most prevalent class achieved an RMSE of approximately 2.28 rings. The top RMSE result achieved via SVM was 2.13 rings after a series of tuning. The ensemble algorithm did not yield a better result than SVM to justify the additional processing and tuning necessary.

The purpose of this project is to analyze a dataset using various machine learning algorithms and to document the steps using a template. The project aims to touch on the following areas:

  • Document a regression predictive modeling problem end-to-end.
  • Explore data transformation options for improving model performance
  • Explore algorithm tuning techniques for improving model performance
  • Explore using and tuning ensemble methods for improving model performance

The HTML formatted report can be found here on GitHub.

Ensemble Classification Model for the Sonar Dataset with R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

For more information on this case study project, please consult Dr. Brownlee’s blog post at https://machinelearningmastery.com/standard-machine-learning-datasets/.

Dataset Used: Connectionist Bench (Sonar, Mines vs. Rocks) Data Set

ML Model: Classification, numeric inputs

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29

The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles. It is a binary (2-class) classification problem.

CONCLUSION: The baseline performance of predicting the most prevalent class achieved an accuracy of approximately 76.0%. Top results achieved via SVM was approximately 85.06% after a series of tuning. The RandomForest ensemble algorithm, also after tuning, yielded an accuracy of 85.09%. The very slight improvement between RF and SVM was too small to justify the additional processing and tuning required by the ensemble algorithm.

The purpose of this project is to analyze a dataset using various machine learning algorithms and to document the steps using a template. The project aims to touch on the following areas:

  • Document a regression predictive modeling problem end-to-end.
  • Explore data transformation options for improving model performance
  • Explore algorithm tuning techniques for improving model performance
  • Explore using and tuning ensemble methods for improving model performance

The HTML formatted report can be found here on GitHub.