Binary Classification Model for Customer Transaction Prediction Using Python (Balanced Boosting)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Balanced Boosting classifier (from the imbalanced-learn package) with inner balancing samplers to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.6844. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.6860. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.5681.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Customer Transaction Prediction Using Python (Balanced Random Forest)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Balanced Random Forest classifier (from the imbalanced-learn package) with inner balancing samplers to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.8224. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.8660. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.7761.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Kaggle Competition: Banco Santander Customer Transaction Prediction Update 2

If you are new to Python machine learning like me, you might find the current Kaggle competition “Santander Customer Transaction Prediction” interesting.

The competition is essentially a binary classification problem with a decently large dataset (200 attributes and 200,000 rows of training data). I have not participated in Kaggle competition before and will use this one to get some learning under the belt.

I plan to run the training data through a list of machine learning algorithms (see below) and iterate them through three stages. This blog post will serve as the meta post that summarizes the progress.

The current plan with the milestones is as follow:

Stage 1: Gather the Baseline Performance.

  • LogisticRegression: completed and posted on Monday 25 February 2019
  • DecisionTreeClassifier: completed and posted on Wednesday 27 February 2019
  • KNeighborsClassifier: completed and posted on Friday 1 March 2019
  • BaggingClassifier: completed and posted on Sunday 3 March 2019
  • RandomForestClassifier: completed and posted on Monday 4 March 2019
  • ExtraTreesClassifier: completed and posted on Wednesday 6 March 2019
  • GradientBoostingClassifier: completed and posted on Friday 8 March 2019

Stage 2: Feature Selection using the Attribute Importance Ranking technique

  • BaggingClassifier: completed and posted on Wednesday 13 March 2019
  • RandomForestClassifier: completed and posted on Friday 15 March 2019
  • ExtraTreesClassifier: completed and posted on Sunday 17 March 2019
  • GradientBoostingClassifier: completed and posted on Monday 18 March 2019

Stage 3: Over-Sampling (SMOTE) and Balancing Ensembles techniques

  • LogisticRegression: completed and posted on Wednesday 20 March 2019
  • ExtraTreesClassifier: completed and posted on Friday 22 March 2019
  • RandomForestClassifier: planned for Monday 25 March 2019
  • GradientBoostingClassifier: planned for Wednesday 27 March 2019
  • Balanced Bagging: planned for Friday 29 March 2019
  • Balanced Boosting: planned for Sunday 31 March 2019
  • Balanced Random Forest: planned for Monday 1 April 2019

I post all Python scripts here on GitHub. The final submission deadline is 10 April 2019.

Feel free to take a look at the scripts and experiment. Who knows, you might have something you can turn in by the time April comes around. Happy learning and good luck!

Binary Classification Model for Customer Transaction Prediction Using Python (Extra Trees with SMOTE)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Extra Trees algorithm with the synthetic over-sampling technique (SMOTE) to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.9769. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.9986. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.5036.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Customer Transaction Prediction Using Python (Logistic Regression with SMOTE)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Logistic Regression algorithm with the synthetic over-sampling technique (SMOTE) to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.8765. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.8788. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.7776.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Customer Transaction Prediction Using Python (Gradient Boosting with Attribute Importance Ranking)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Gradient Boosting algorithm with a reduced set of features (derived from using the Attribute Importance Ranking technique with the GradientBoostingClassifier algorithm) for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.8322. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.8619. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.5798.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Customer Transaction Prediction Using Python (Random Forest with Attribute Importance Ranking)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Random Forest algorithm with a reduced set of features (derived from using the Attribute Importance Ranking technique with the GradientBoostingClassifier algorithm) for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.7208. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.8330. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.5013.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Model for Customer Transaction Prediction Using Python (Extra Trees with Attribute Importance Ranking)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Extra Trees algorithm with a reduced set of features (derived from using the Attribute Importance Ranking technique with the GradientBoostingClassifier algorithm) for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.6658. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.8441. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.5000.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.