Regression Model for Diabetes 130-US Hospitals Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Diabetes 130-US Hospitals dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: The data set is the Diabetes 130-US Hospitals for years 1999-2008 donated to the University of California, Irvine (UCI) Machine Learning Repository. The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes.

ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 63.72%. After a series of tuning trials, the refined XGBoost model processed the training dataset with a final accuracy score of 64.94%. When we processed the test dataset with the final model, the model achieved an accuracy score of 65.39%.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Diabetes 130-US Hospitals for years 1999-2008 Dataset

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive-beta.ics.uci.edu/ml/datasets/296

One potential source of performance benchmarks: http://www.hindawi.com/journals/bmri/2014/781670/

The HTML formatted report can be found here on GitHub.

Regression Model for Kaggle Tabular Playground Series 2021 Apr Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Apr 2021 dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions on Kaggle with fun but less complex, tabular datasets. The dataset used for this competition is synthetic but based on the real Titanic dataset and generated using a CTGAN. The statistical properties of this dataset are very similar to the original Titanic dataset, but there is no shortcut to cheat by using public labels for predictions.

ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 0.7707. After a series of tuning trials, the refined XGBoost model processed the training dataset with a final accuracy score of 0.7725. When we applied the last model to Kaggle’s test dataset, the model achieved a ROC score of 0.7832.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Apr Data Set

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-apr-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-apr-2021/leaderboard

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Kaggle Tabular Playground Series 2021 Mar Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Mar 2021 dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions on Kaggle with fun but less complex, tabular datasets. The dataset may be synthetic but is based on a real dataset and generated using a CTGAN. The original dataset tries to predict the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

ANALYSIS: The performance of the preliminary XGBoost model achieved a ROC benchmark of 0.8813. After a series of tuning trials, the refined XGBoost model processed the training dataset with a final ROC score of 0.8936. When we applied the last model to Kaggle’s test dataset, the model achieved a ROC score of 0.8946.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Mar Data Set

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-mar-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-mar-2021/leaderboard

The HTML formatted report can be found here on GitHub.

Regression Model for Kaggle Tabular Playground Series 2021 Jan Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Series 2021 Jan dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have been hosting playground-style competitions on Kaggle with fun but less complex, tabular datasets. These competitions will be great for people looking for something between the Titanic Getting Started competition and a Featured competition.

ANALYSIS:  The performance of the preliminary XGBoost model achieved an RMSE benchmark of 0.5068. After a series of tuning trials, the refined XGBoost model processed the training dataset with a final RMSE score of 0.4883. When we applied the last model to Kaggle’s test dataset, the model achieved an RMSE score of 0.6996.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Jan Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-jan-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-jan-2021/leaderboard

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Company Bankruptcy Prediction Using XGBoost Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Company Bankruptcy Prediction dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: The research team collected the data from the Taiwan Economic Journal from 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange. Because not catching companies in a shaky financial situation is a costly business proposition, we will maximize the precision and recall ratios with the F1 score.

The data analysis first appeared on the research paper, Liang, D., Lu, C.-C., Tsai, C.-F., and Shih, G.-A. (2016) Financial Ratios and Corporate Governance Indicators in Bankruptcy Prediction: A Comprehensive Study. European Journal of Operational Research, vol. 252, no. 2, pp. 561-572.

In iteration Take1, we constructed and tuned several classic machine learning models using the Scikit-Learn library. We also observed the best results that we could obtain from the models.

This Take2 iteration will construct and tune an XGBoost model. We also will observe the best results that we can obtain from the models.

ANALYSIS: In iteration Take1, the machine learning algorithms’ average performance achieved an F1 score of 94.37%. Two algorithms (Extra Trees and Random Forest) produced the top F1 metrics after the first round of modeling. After a series of tuning trials, the Extra Trees model turned in an F1 score of 97.39% using the training dataset. When we applied the Extra Tree model to the previously unseen test dataset, we obtained an F1 score of 55.55%.

In this Take2 iteration, the XGBoost algorithm achieved an F1 score of 96.48% using the training dataset. After a series of tuning trials, the XGBoost model turned in an F1 score of 98.38%. When we applied the XGBoost model to the previously unseen test dataset, we obtained an F1 score of 58.18%.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset. We should consider using the algorithm for further modeling.

Dataset Used: Company Bankruptcy Prediction Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction

One potential source of performance benchmark: https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction

The HTML formatted report can be found here on GitHub.

Multi-Class Classification Model for Sign Language MNIST Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Sign Language MNIST dataset is a multi-class classification situation where we attempt to predict one of several (more than two) possible outcomes.

INTRODUCTION: The original MNIST image dataset of handwritten digits is a popular benchmark for image-based machine learning methods. The Sign Language MNIST is presented here and follows the same CSV format with labels and pixel values in single rows to stimulate the community to develop more drop-in replacements. The American Sign Language letter database of hand gestures represent a multi-class problem with 24 classes of letters (excluding J and Z, which require motion).

The dataset format is patterned to match closely with the classic MNIST. Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions). The training data (27,455 cases) and test data (7172 instances) are approximately half the size of the standard MNIST but otherwise similar with a header row of the labels, pixel1,pixel2….pixel784 which represent a single 28×28 pixel image with grayscale values between 0-255. The original hand gesture image data represented multiple users repeating the gesture against different backgrounds.

ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 95.41%. After a series of tuning trials, the best XGBoost model processed the training dataset with an accuracy score of 99.68%. When we applied the final model to the previously unseen test dataset, we obtained an accuracy score of 78.93%, which pointed to a high variance error.

CONCLUSION: In this iteration, the XGBoost model did not appear to be suitable for modeling this dataset. We should consider experimenting another algorithm with this dataset.

Dataset Used: Sign Language MNIST Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://www.kaggle.com/datamunge/sign-language-mnist

One source of potential performance benchmarks: https://www.kaggle.com/datamunge/sign-language-mnist

The HTML formatted report can be found here on GitHub.

XGBoost Machine Learning Templates v2 for Python

As I work on practicing and solving machine learning (ML) problems, I repeatedly find myself duplicating a set of steps and activities.

Thanks to Dr. Jason Brownlee’s suggestions on creating a machine learning template, I have pulled together a set of project templates that I use to experiment with modeling ML problems using Python and XGBoost.

Version 2 of the XGBoost templates contain minor adjustments and corrections to the prevision version of the template. The updated templates also include:

  • Scikit-learn’s ColumnTransformer, imputing, and pipeline utilities for feature scaling and transformation tasks

You will find the Python templates on the Machine Learning Project Templates page.

Binary Classification Model for Credit Card Default Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Credit Card Default dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: This dataset contains information on default payments, demographic factors, credit data, payment history, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

ANALYSIS: The baseline performance of the XGBoost algorithm achieved an accuracy benchmark of 81.44%. After a series of tuning trials, the XGBoost model processed the training dataset with an accuracy score of 82.20%. When we applied the XGBoost algorithm to the previously unseen test dataset, we obtained an accuracy score of 81.81%.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset. We should consider using the algorithm for further modeling.

Dataset Used: Default of Credit Card Clients Dataset

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

The HTML formatted report can be found here on GitHub.