Multi-Class Classification Model for Sign Language MNIST Using Python and Scikit-Learn

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Sign Language MNIST dataset is a multi-class classification situation where we attempt to predict one of several (more than two) possible outcomes.

INTRODUCTION: The original MNIST image dataset of handwritten digits is a popular benchmark for image-based machine learning methods. The Sign Language MNIST is presented here and follows the same CSV format with labels and pixel values in single rows to stimulate the community to develop more drop-in replacements. The American Sign Language letter database of hand gestures represent a multi-class problem with 24 classes of letters (excluding J and Z, which require motion).

The dataset format is patterned to match closely with the classic MNIST. Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions). The training data (27,455 cases) and test data (7172 instances) are approximately half the size of the standard MNIST but otherwise similar with a header row of the labels, pixel1,pixel2….pixel784 which represent a single 28×28 pixel image with grayscale values between 0-255. The original hand gesture image data represented multiple users repeating the gesture against different backgrounds.

ANALYSIS: The average performance of the machine learning algorithms achieved an accuracy benchmark of 96.38%. Two algorithms (Extra Trees and Random Forest) produced the top accuracy metrics after the first round of modeling. After a series of tuning trials, the Extra Trees model turned in an accuracy metric of 99.61%. When configured with the optimized parameters, the Extra Trees model processed the validation dataset with an accuracy score of 99.83%. When we applied the Extra Trees model to the previously unseen test dataset, we obtained an accuracy score of 83.49%, which pointed to a high variance error.

CONCLUSION: In this iteration, the Extra Trees model did not appear to be suitable for modeling this dataset. We should consider experimenting another algorithm with this dataset.

Dataset Used: Sign Language MNIST Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://www.kaggle.com/datamunge/sign-language-mnist

One source of potential performance benchmarks: https://www.kaggle.com/datamunge/sign-language-mnist

The HTML formatted report can be found here on GitHub.

Updated Scikit-Learn Machine Learning Templates v15 for Python

As I work on practicing and solving machine learning (ML) problems, I find myself repeating a set of steps and activities repeatedly.

Thanks to Dr. Jason Brownlee’s suggestions on creating a machine learning template, I have pulled together a set of project templates that I use to experiment with modeling ML problems using Python and Scikit-Learn.

Version 15 of the Scikit-Learn templates contains minor adjustments and corrections to the prevision version of the model. The updated templates include the following:

  • Introduced example code segments for splitting one original datasets into training, validation, and test datasets
  • Introduced example code segments for pre-processing and scaling data with Scikit-Learn’s pipeline

You will find the Python templates on the Machine Learning Project Templates page.

Binary Classification Model for BNP Paribas Cardif Claims Management Using Scikit-Learn Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The BNP Paribas Cardif Claims Management dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: As a global specialist in personal insurance, BNP Paribas Cardif sponsored a Kaggle competition to help them identify the categories of claims. In a world shaped by the emergence of new practices and behaviors generated by the digital economy, BNP Paribas Cardif would like to streamline its claims management practice. In this Kaggle challenge, the company challenged the participants to predict the category of a claim based on features available early in the process. Better predictions can help BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.

In iteration Take1, we constructed and tuned several machine learning models using the Scikit-learn library. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and submitted a list of predictions for evaluation.

In this Take2 iteration, we will construct and tune an XGBoost model. Furthermore, we will apply the XGBoost model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average log loss of 0.6422. Two algorithms (Logistic Regression and Random Forest) achieved the top log loss metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result. Random Forest achieved a log loss metric of 0.4722. When configured with the optimized parameters, the Extra Trees model processed the validation dataset with a log loss of 0.4706, which was consistent with the model training phase. When we applied the Random Forest model to Kaggle’s test dataset, we obtained a log loss score of 0.4635.

From this Take2 iteration, the baseline performance of the XGBoost model achieved a log loss of 0.4706. After a series of tuning trials, the XGBoost model reached a log loss metric of 0.4650. When configured with the optimized parameters, the XGBoost model processed the validation dataset with a log loss of 0.4674, which was consistent with the model training phase. When we applied the XGBoost model to Kaggle’s test dataset, we obtained a log loss score of 0.4634.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall results using the training and test datasets. For this dataset, we should consider further modeling with the XGBoost algorithm.

Dataset Used: BNP Paribas Cardif Claims Management Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/overview

One potential source of performance benchmark: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard

The HTML formatted report can be found here on GitHub.

Binary Classification Model for BNP Paribas Cardif Claims Management Using Scikit-Learn Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The BNP Paribas Cardif Claims Management dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: As a global specialist in personal insurance, BNP Paribas Cardif sponsored a Kaggle competition to help them identify the categories of claims. In a world shaped by the emergence of new practices and behaviors generated by the digital economy, BNP Paribas Cardif would like to streamline its claims management practice. In this Kaggle challenge, the company challenged the participants to predict the category of a claim based on features available early in the process. Better predictions can help BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.

In this Take1 iteration, we will construct and tune several machine learning models using the Scikit-learn library. Furthermore, we will apply the best-performing machine learning model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average log loss of 0.6422. Two algorithms (Logistic Regression and Random Forest) achieved the top log loss metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in a better overall result. Random Forest achieved a log loss metric of 0.4722. When configured with the optimized parameters, the Extra Trees model processed the validation dataset with a log loss of 0.4706, which was consistent with the model training phase. When we applied the Random Forest model to Kaggle’s test dataset, we obtained a log loss score of 0.4635.

CONCLUSION: For this iteration, the Random Forest model achieved the best overall results using the training and test datasets. For this dataset, we should consider further modeling with the Random Forest algorithm.

Dataset Used: BNP Paribas Cardif Claims Management Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/overview

One potential source of performance benchmark: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Truck APS Failure Using Scikit-Learn Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Truck APS Failure dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air that supports functions such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The training set contains 60000 examples in total, in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

The challenge is to minimize the total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure. The “Cost_1” variable refers to the cost resulted from a redundant check by a mechanic at the workshop. Meanwhile, the “Cost_2” variable refers to the cost of not catching a faulty truck. The cost of Type I error (cost_1) is 10, while the cost of the Type II error (cost_2) is 500.

In iteration Take1, we constructed and tuned machine learning models for this dataset using the Scikit-Learn library. We also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In iteration Take2, we attempted to provide more balance to this imbalanced dataset by using “Synthetic Minority Oversampling TEchnique” or SMOTE for short. We increased the population of the minority class from approximately 0.1% to approximately 33% of the training instances. Furthermore, we also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In iteration Take3, we constructed and tuned machine learning models for this dataset using the XGBoost library. We also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In this Take4 iteration, we will attempt to provide more balance to this imbalanced dataset by using “Synthetic Minority Oversampling TEchnique” or SMOTE for short. We will decrease the population of the majority class to be the same as the minority class of the training instances. Furthermore, we will observe the best sensitivity/recall score that we can obtain using the tuned models with the training and test datasets.

ANALYSIS: From iteration Take1, the performance of the machine learning algorithms achieved an average recall metric of 59.26%. Two algorithms (Extra Trees and Random Forest) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 68.53%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 66.40%. Furthermore, the optimized model processed the test dataset with a score of 71.73% with a high Type II error rate.

From iteration Take2, the performance of the machine learning algorithms achieved an average recall metric of 98.21%. Two algorithms (Extra Trees and k-Nearest Neighbors) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 99.72%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 80.40%. Furthermore, the optimized model processed the test dataset with a score of 82.40% with a high Type II error rate.

From iteration Take3, the performance of the XGBoost algorithm achieved a baseline recall metric of 75.86%. After a series of tuning trials, the XGBoost model completed the training phase and achieved a score of 99.72%. When configured with the optimized learning parameters, the XGBoost model processed the validation dataset with a score of 72.80%. Furthermore, the optimized model processed the test dataset with a score of 78.93% with a high Type II error rate.

From this Take4 iteration, the performance of the XGBoost algorithm achieved a baseline recall metric of 96.67%. After a series of tuning trials, the XGBoost model completed the training phase and achieved a score of 96.80%. When configured with the optimized learning parameters, the XGBoost model processed the validation dataset with a score of 97.20%. Furthermore, the optimized model processed the test dataset with a score of 98.66% with a low Type II error rate.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the Extra Trees algorithm for further modeling and testing activities.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

One potential source of performance benchmark: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Truck APS Failure Using Scikit-Learn Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Truck APS Failure dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air that supports functions such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The training set contains 60000 examples in total, in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

The challenge is to minimize the total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure. The “Cost_1” variable refers to the cost resulted from a redundant check by a mechanic at the workshop. Meanwhile, the “Cost_2” variable refers to the cost of not catching a faulty truck. The cost of Type I error (cost_1) is 10, while the cost of the Type II error (cost_2) is 500.

In iteration Take1, we constructed and tuned machine learning models for this dataset using the Scikit-Learn library. We also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In iteration Take2, we attempted to provide more balance to this imbalanced dataset by using “Synthetic Minority Oversampling TEchnique” or SMOTE for short. We up-sample’ed the minority class from approximately 0.1% to approximately 33% of the training instances. Furthermore, we also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In this Take3 iteration, we will construct and tune machine learning models for this dataset using the Scikit-Learn library. We will observe the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

ANALYSIS: From iteration Take1, the performance of the machine learning algorithms achieved an average recall metric of 59.26%. Two algorithms (Extra Trees and Random Forest) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 68.53%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 66.40%. Furthermore, the optimized model processed the test dataset with a score of 71.73% with a high Type II error rate.

From iteration Take2, the performance of the machine learning algorithms achieved an average recall metric of 98.21%. Two algorithms (Extra Trees and k-Nearest Neighbors) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 99.72%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 80.40%. Furthermore, the optimized model processed the test dataset with a score of 82.40% with a high Type II error rate.

From this Take3 iteration, the performance of the XGBoost algorithm achieved an average recall metric of 75.86%. After a series of tuning trials, the XGBoost model completed the training phase and achieved a score of 99.72%. When configured with the optimized learning parameters, the XGBoost model processed the validation dataset with a score of 72.80%. Furthermore, the optimized model processed the test dataset with a score of 78.93% with a high Type II error rate.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the Extra Trees algorithm for further modeling and testing activities.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

One potential source of performance benchmark: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Truck APS Failure Using Scikit-Learn Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Truck APS Failure dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air that supports functions such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The training set contains 60000 examples in total, in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

The challenge is to minimize the total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure. The “Cost_1” variable refers to the cost resulted from a redundant check by a mechanic at the workshop. Meanwhile, the “Cost_2” variable refers to the cost of not catching a faulty truck. The cost of Type I error (cost_1) is 10, while the cost of the Type II error (cost_2) is 500.

In iteration Take1, we constructed and tuned machine learning models for this dataset using the Scikit-Learn library. We also observed the best sensitivity/recall score that we could obtain using the tuned models with the training and test datasets.

In this Take2 iteration, we will attempt to provide more balance to this imbalanced dataset by using “Synthetic Minority Oversampling TEchnique” or SMOTE for short. We will up-sample the minority class from approximately 0.1% to approximately 33% of the training instances. Furthermore, we will observe the best sensitivity/recall score that we can obtain using the tuned models with the training and test datasets.

ANALYSIS: From iteration Take1, the performance of the machine learning algorithms achieved an average recall metric of 59.26%. Two algorithms (Extra Trees and Random Forest) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 68.53%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 66.40%. Furthermore, the optimized model processed the test dataset with a score of 71.73% with a high Type II error rate.

From this Take2 iteration, the performance of the machine learning algorithms achieved an average recall metric of 95.79%. Two algorithms (Extra Trees and k-Nearest Neighbors) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 99.67%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 80.40%. Furthermore, the optimized model processed the test dataset with a score of 82.40% with a high Type II error rate.

CONCLUSION: For this iteration, the Extra Trees model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the Extra Trees algorithm for further modeling and testing activities.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

One potential source of performance benchmark: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Truck APS Failure Using Scikit-Learn Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Truck APS Failure dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air that supports functions such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The training set contains 60000 examples in total, in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

The challenge is to minimize the total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure. The “Cost_1” variable refers to the cost resulted from a redundant check by a mechanic at the workshop. Meanwhile, the “Cost_2” variable refers to the cost of not catching a faulty truck. The cost of Type I error (cost_1) is 10, while the cost of the Type II error (cost_2) is 500.

In this Take1 iteration, we will construct and tune machine learning models for this dataset using the Scikit-Learn library. We will observe the best sensitivity/recall score that we can obtain using the tuned models with the training and test datasets.

ANALYSIS: From this Take1 iteration, the performance of the machine learning algorithms achieved an average recall metric of 59.26%. Two algorithms (Extra Trees and Random Forest) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 68.53%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 66.40%. Furthermore, the optimized model processed the test dataset with a score of 71.73% with a high Type II error rate.

CONCLUSION: For this iteration, the Random Forest model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the Random Forest algorithm for further modeling and testing activities.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

One potential source of performance benchmark: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.