Binary Classification Model for Santander Customer Satisfaction Using TensorFlow Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Customer Satisfaction dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank sponsored a Kaggle competition to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late. In this competition, Santander has provided hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. The exercise evaluates the submissions on the area under the ROC curve (AUC) between the predicted probability and the observed target.

In iteration Take1, we constructed several Multilayer Perceptron (MLP) models with two hidden layers. We also observed the best result that we could obtain using the two-layer model. Lastly, we applied the MLP model to Kaggle’s test dataset and submitted a list of predictions to Kaggle for evaluation.

In iteration Take2, we constructed several Multilayer Perceptron (MLP) models with three hidden layers. We also observed the best result that we could obtain using the three-layer model. Lastly, we applied the MLP model to Kaggle’s test dataset and submitted a list of predictions to Kaggle for evaluation.

In this Take3 iteration, we will construct several Multilayer Perceptron (MLP) models with four hidden layers. We will also observe the best result that we can obtain using the three-layer model. Lastly, we will apply the MLP model to Kaggle’s test dataset and submit a list of predictions to Kaggle for evaluation.

ANALYSIS: From iteration Take1, all two-layer models achieved a ROC-AUC performance of between 79.9% and 81.1% after 25 epochs using the validation dataset. The 64/32-node model appeared to have the highest ROC-AUC of 81.142% with low variance. Lastly, when we applied the two-layer neural network model to the test dataset from Kaggle, we obtained a ROC-AUC score of 80.460%.

From iteration Take2, all three-layer models achieved a ROC-AUC performance of between 79.3% and 81.5% after 25 epochs using the validation dataset. The 224/160/96-node model appeared to have the highest ROC-AUC of 81.56% with low variance. Lastly, when we applied the three-layer neural network model to the test dataset from Kaggle, we obtained a ROC-AUC score of 81.193%.

From this Take3 iteration, all four-layer models achieved a ROC-AUC performance of between 78.5% and 81.3% after 25 epochs using the validation dataset. The 224/160/128/64-node model appeared to have the highest ROC-AUC of 81.51% with low variance. Lastly, when we applied the four-layer neural network model to the test dataset from Kaggle, we obtained a ROC-AUC score of 81.665%.

CONCLUSION: For this iteration, the four-layer model with 224/160/128/64 nodes appeared to have yielded the best result. For this dataset, we should consider experimenting with more MLP models with different configurations.

Dataset Used: Santander Customer Satisfaction Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-satisfaction/overview

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard

The HTML formatted report can be found here on GitHub.