Binary Classification Model for Santander Customer Satisfaction Using Scikit-Learn Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Customer Satisfaction dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank sponsored a Kaggle competition to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late. In this competition, Santander has provided hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. The exercise evaluates the submissions on the area under the ROC curve (AUC) between the predicted probability and the observed target.

In iteration Take1, we constructed and tuned several machine learning models using the Scikit-learn library. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and submitted a list of predictions for evaluation.

In iteration Take2, we provided a more balanced dataset using “Synthetic Minority Oversampling TEchnique,” or SMOTE for short. We increased the minority class through sampling from approximately 3.9% to 20% of the training instances. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and observed whether the training on the balanced dataset had any positive impact on the prediction results.

In this Take3 iteration, we will construct and tune an XGBoost model. Furthermore, we will apply the XGBoost model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average AUC of 67.94%. Two algorithms (Random Forest and Gradient Boosting) achieved the top AUC metrics after the first round of modeling. After a series of tuning trials, the Gradient Boosting model turned in a better overall result than Random Forest with a higher AUC. Gradient Boosting achieved an AUC metric of 83.60%, and the same Gradient Boosting model processed the test dataset with an AUC of 83.57%, which was consistent with the training result. Lastly, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 82.15%.

From iteration Take2, the baseline performance of the machine learning algorithms achieved an average AUC of 87.90%. Two algorithms (Random Forest and Gradient Boosting) achieved the top AUC metrics after the first round of modeling. After a series of tuning trials, the Gradient Boosting model turned in a better overall result than Random Forest with a higher AUC. Gradient Boosting achieved an AUC metric of 96.20%, and the same Gradient Boosting model processed the test dataset with an AUC of 81.93%, which indicated a high variance issue. Lastly, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 81.17%.

From this Take3 iteration, the baseline performance of the XGBoost model achieved an AUC of 83.97%. After a series of tuning trials, the XGBoost model processed the test dataset with an AUC of 83.98%, which was consistent with the training result. Lastly, when we applied the XGBoost model to the test dataset from Kaggle, we obtained a ROC-AUC score of 82.42%.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall result using the training and test datasets. For this dataset, we should consider XGBoost and other machine learning algorithms for further modeling and testing.

Dataset Used: Santander Customer Satisfaction Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-satisfaction/overview

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Santander Customer Satisfaction Using Scikit-Learn Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Customer Satisfaction dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank sponsored a Kaggle competition to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late. In this competition, Santander has provided hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. The exercise evaluates the submissions on the area under the ROC curve (AUC) between the predicted probability and the observed target.

In iteration Take1, we constructed and tuned several machine learning models using the Scikit-learn library. Furthermore, we applied the best-performing machine learning model to Kaggle’s test dataset and submitted a list of predictions for evaluation.

In this Take2 iteration, we will attempt to provide more balance to this imbalanced dataset using “Synthetic Minority Oversampling TEchnique,” or SMOTE for short. We will up-sample the minority class from approximately 3.9% to 20% of the training instances. Furthermore, we will apply the best-performing machine learning model to Kaggle’s test dataset and observe whether the training on the balanced dataset had any positive impact on the prediction results.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average AUC of 67.94%. Two algorithms (Random Forest and Gradient Boosting) achieved the top AUC metrics after the first round of modeling. After a series of tuning trials, the Gradient Boosting model turned in a better overall result than Random Forest with a higher AUC. Gradient Boosting achieved an AUC metric of 83.60%, and the same Gradient Boosting model processed the test dataset with an AUC of 83.57%, which was consistent with the training result. Lastly, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 82.15%.

From this Take2 iteration, the baseline performance of the machine learning algorithms achieved an average AUC of 87.90%. Two algorithms (Random Forest and Gradient Boosting) achieved the top AUC metrics after the first round of modeling. After a series of tuning trials, the Gradient Boosting model turned in a better overall result than Random Forest with a higher AUC. Gradient Boosting achieved an AUC metric of 96.20%, and the same Gradient Boosting model processed the test dataset with an AUC of 81.93%, which indicated a high variance issue. Lastly, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 81.17%.

CONCLUSION: For this iteration, the Gradient Boosting model achieved the best overall result using the training and test datasets. For this dataset, we should consider Gradient Boosting and other machine learning algorithms for further modeling and testing.

Dataset Used: Santander Customer Satisfaction Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-satisfaction/overview

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard

The HTML formatted report can be found here on GitHub.

Seth Godin’s Akimbo: Organized Learning

In his Akimbo podcast, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth discusses that organized learning has put human race ahead of all other species and how we should leverage such learning approach to continually improve our culture.

Compared to other species, human beings have advanced much further primarily because we have figured out how to share learning from generation to generation. Many things we have chosen to believe are a function of organized learning.

Organized learning relies on peer to peer connection. Technology also has contributed dramatically to organized learning. Technology changes things because it permits far more diversity of thought and connection.

Organized learning takes many forms. When we give some people less access to tools, leverage, or learning, that is a form of organized learning. We are sending a message whenever we set expectations for people based on some pre-determined criteria. When we build something, when we connect, when we interact, we are organizing some learning for the people around us.

Organized learning also relies on the choices we make, and that is good news. It means, if we choose, we can change it. As we’re staring straight into many of the social issues that face us, we need to realize that we cannot fix these problems overnight. However, these problems are not permanent, either. They are not permanent because organized learning can change the situation.

For many years and through organized learning, we have created images and models and pathways forward for people based on things that are utterly unrelated to their skills or what they can contribute.

Now technology has put access to infinite amounts of knowledge in front of us. We must think about how we are going to organize this organized learning into a bundle of interaction that creates opportunities for more people.

It turns out that keeping some people out of the system is not a productive thing to do for society. Keeping some people down does not help others. In an economy that is based on connection, ideas, and possibility, we have discovered that, when we keep someone away from the chance to contribute, we do not get the benefit of their contribution.

These days, our culture is made up of more than just a few voices. Some of those voices were racist, angry voices that wanted to put other people down. But many of those voices are different. They are the voice of possibility, the voice of organizing a movement, and voice of we-can-make-things-better. We have an opportunity to reorganize the organized learning around us.

Each of us not only lives in culture but also makes culture every day. Each of us has the chance to model behaviors that make things better in ways we never even. Culture is formed from the grass-root level, by each of us in how we act, in what we say, in how we say it, and how we respond or choose to react.

The pure reaction does not get us very far, but the active response of openness or a hand to help somebody get to the next level can do wonders. These responses turn out to pay off for all of us.

Organized learning leads to a piece of culture. We should strive to learning from our situations/changes and do it in a more organized fashion for the better. That is what marketing tries to achieve. We have a chance to lay out a path, which can make things better by making better things. None of us can change all of the cultures, but each of us can change the circle around us, a circle of people who want to act like a certain way.

小型企業不僅僅是大型企業的小版本

(從我一個尊敬的作家,賽斯·高汀

會議少,資源少,但約束更少。

小型企業最大的優勢在於,店東可以與顧客面對面的交流。反之亦然。

小型企業的前進之路是一開始就是可能激勵您的事情,而不是政策,團隊思考和槓桿作用,而是找出人們的需求並幫助他們馬上得到它。

成為一個小型企業從來都不是一件容易的事,尤其現在是更加困難。但是彈性和靈活性是可以兼得的。

第一條規則仍然是:先弄清人們的需求並將其帶給他們。

Binary Classification Model for Santander Customer Satisfaction Using Scikit-Learn Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Customer Satisfaction dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank sponsored a Kaggle competition to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late. In this competition, Santander has provided hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. The exercise evaluates the submissions on the area under the ROC curve (AUC) between the predicted probability and the observed target.

In this Take1 iteration, we will construct and tune several machine learning models using the Scikit-learn library. Furthermore, we will apply the best-performing machine learning model to Kaggle’s test dataset and submit a list of predictions for evaluation.

ANALYSIS: In this Take1 iteration, the baseline performance of the machine learning algorithms achieved an average AUC of 67.94%. Two algorithms (Random Forest and Gradient Boosting) achieved the top AUC metrics after the first round of modeling. After a series of tuning trials, the Gradient Boosting model turned in a better overall result than Random Forest with a higher AUC. Gradient Boosting achieved an AUC metric of 83.60%. When configured with the optimized parameters, the Gradient Boosting model processed the test dataset with an AUC of 83.57%, which was consistent with the training result. However, when we applied the Gradient Boosting model to the test dataset from Kaggle, we obtained a ROC-AUC score of 82.15%.

CONCLUSION: For this iteration, the Gradient Boosting model achieved the best overall result using the training and test datasets. For this dataset, we should consider Gradient Boosting and other machine learning algorithms for further modeling and testing.

Dataset Used: Santander Customer Satisfaction Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-satisfaction/overview

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard

The HTML formatted report can be found here on GitHub.

Algorithmic Trading Model for Exponential Moving Average Crossover Grid Search Batch Mode Using Colab

NOTE: This script is for learning purposes only and does not constitute a recommendation for buying or selling any stock mentioned in this script.

SUMMARY: The purpose of this project is to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a series of exponential moving average (MA) models via a grid search methodology. When the fast moving-average curve crosses above the slow moving-average curve, the strategy goes long (buys) on the stock. When the opposite occurs, we will exit the position.

The grid search methodology will search through all combinations between the two MA curves. The faster MA curve can range from 5 days to 20 days, while the slower MA can range from 10 days to 50 days. Both curves use a 5-day increment.

ANALYSIS: This is the Google Colab version of the iPython notebook posted on June 16, 2020. The script will save all output for each stock into a text file and on a Google Drive path. The Colab script contains an example of processing 100 different stocks in one batch.

CONCLUSION: Please refer to the individual output file for each stock.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

Algorithmic Trading Model for Simple Moving Average Crossover Grid Search Batch Mode Using Colab

NOTE: This script is for learning purposes only and does not constitute a recommendation for buying or selling any stock mentioned in this script.

SUMMARY: The purpose of this project is to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a series of simple moving average (MA) models via a grid search methodology. When the fast moving-average curve crosses above the slow moving-average curve, the strategy goes long (buys) on the stock. When the opposite occurs, we will exit the position.

The grid search methodology will search through all combinations between the two MA curves. The faster MA curve can range from 5 days to 20 days, while the slower MA can range from 10 days to 50 days. Both curves use a 5-day increment.

ANALYSIS: This is the Google Colab version of the iPython notebook posted on June 16, 2020. The script will save all output for each stock into a text file and on a Google Drive path. The Colab script contains an example of processing 100 different stocks in one batch.

CONCLUSION: Please refer to the individual output file for each stock.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

Multi-Class Model for Human Activity Recognition Using TensorFlow Take 6

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activity Recognition Using Smartphones dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers, with each person performing six activities by wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% the test data.

In previous iterations, the script focused on evaluating various classic machine learning algorithms and identify the algorithm that produces the best accuracy metric. The previous iterations established a baseline performance in terms of accuracy and processing time.

In iteration Take1, we constructed and tuned an XGBoost machine learning model for this dataset. We also observed the best accuracy result that we could obtain using the XGBoost model with the training and test datasets.

In iteration Take2, we constructed several Multilayer Perceptron (MLP) models with one hidden layer. These simple MLP models would serve as a benchmark as we build more complex MLP models in future iterations.

In iteration Take3, we constructed several Multilayer Perceptron (MLP) models with two hidden layers. We also experimented with the dropout layers as a regularization technique for improving our models.

In iteration Take4, we constructed several Multilayer Perceptron (MLP) models with three hidden layers. We also experimented with the dropout layers (25% or 0.25) as a regularization technique for improving our models.

In iteration Take5, we constructed several Multilayer Perceptron (MLP) models with four hidden layers. We also experimented with the dropout layers (25% or 0.25) as a regularization technique for improving our models.

In this Take6 iteration, we will construct MLP models with four hidden layers of 2048/1024/512/256 nodes. We also will fine-tune the model with different dropout rates at each layer.

ANALYSIS: From iteration Take1, the XGBoost model achieved an accuracy metric of 99.45% in training. When configured with the optimized parameters, the XGBoost model processed the test dataset with an accuracy of 94.94%, which indicated a high variance issue. We will need to explore regularization techniques or other modeling approaches before deploying the model for production use.

From iteration Take2, the one-layer MLP models achieved an accuracy metric of between 98.8% and 99.3% after 50 epochs in training. Those same models processed the test datasets with an accuracy metric of between 93.0% and 95.9%.

From iteration Take3, the two-layer MLP models achieved an accuracy metric of between 96.2% and 98.5% after 50 epochs in training. Those same models processed the test datasets with an accuracy metric of between 93.6% and 96.2%.

From iteration Take5, the four-layer MLP models achieved an accuracy metric of between 91.9% and 98.3% after 100 epochs in training. Those same models processed the test datasets with an accuracy metric of between 85.0% and 95.8%.

For this Take6 iteration, the best four-layer MLP model with dropout layers of (0.75, 0.75, 0.25, 0.25) achieved an accuracy metric of 93.11% after 100 epochs in training. The same model processed the test datasets with an accuracy metric of 93.00%.

CONCLUSION: For this iteration, the four-layer MLP models produced mixed results with smaller but still noticeable variance. However, the model with 2048/1024/512/256 nodes with dropout layers of (0.75, 0.75, 0.25, 0.25) produced a balance of high accuracy and low variance. For this dataset, we can explore other modeling approaches to reduce variance before deploying the model for production use.

Dataset Used: Human Activity Recognition Using Smartphones

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

The HTML formatted report can be found here on GitHub.