Binary Classification Model for Customer Transaction Prediction Using Python (Gradient Boosting with SMOTE)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Gradient Boosting algorithm with the synthetic over-sampling technique (SMOTE) to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.9092. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.9405. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.6289.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

時間和金錢

(從我的一個喜歡與尊敬的作家,賽斯 高汀

“我負擔不起。”

“我沒有時間。”

幾乎總是意味著,“這不是我的優先事項。”

當我們關心某件事的時侯,我們所能做的之多往往會是另人吃驚。選擇關心的一種方法就是明確的分辨出您的優先事項,這意味著要用您自己的語言來分清楚。

所以我們可以對自己說,“我也許想做這件事,但它不是優先。”

顯著的工作通常是由具有非典型優先想法的人來完成。

Binary Classification Model for Customer Transaction Prediction Using Python (Balanced Bagging)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Balanced Bagging classifier (from the imbalanced-learn package) with inner balancing samplers to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.7144. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.7799. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.6659.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Drucker on Principles of Innovation, Part 2

In his book, The Essential Drucker: The Best of Sixty Years of Peter Drucker’s Essential Writings on Management, Peter Drucker analyzed the ways that management practices and principles affect the performance of organizations, individuals, and society. The book covers the basic principles of management and gives professionals the tools to perform the tasks that the environment of tomorrow will require of them.

These are my takeaways from reading the book.

Drucker believed that innovation is a practice, something that we can learn to do by applying hard, organized, purposeful work. He discussed the five “dos” of innovation.

  1. Purposeful innovation begins with a systematic analysis of the opportunities.
  2. Innovation is both conceptual and perceptual.
  3. Effective innovation must be simple and focused.
  4. Effective innovations start small, not grandiose.
  5. A successful innovation aims at the leadership of something.

Drucker also outlined three things that we should not do when building a practice of innovation.

The first “do not” is trying to be clever. Human beings are the ultimate user of the innovations, and incompetence or carelessness can easily derail an innovation. Anything too clever, whether in design or execution, is almost certain to fail.

The second “do not” is trying to diversify or to do too many things at once. When we design innovations that stray from a core, the innovations are likely to become diffuse.

Innovation needs the concentrated energy of a unified effort behind it. It also requires that the people who put together the innovation collaborate well with each other. This collaboration requires a unity or a common core. The core does not have to be technology or knowledge. Drucker believed that market knowledge supplies a better core of unity in any enterprise than technology does.

Lastly, Drucker suggested not trying to innovate for the future but to innovate for the present. Innovation may have long-range impact, but focusing solely on the distant future might incur the opportunity costs of shorter-term benefits.

Drucker explained the “not-waiting-for-the-future” suggestion using a pharmaceutical example. Although many years of research and development work are common in pharmaceutical research, no pharmaceutical company would start a research project for something that does not have potential, immediate application for needs that already exist.

Unless there is an immediate application in the present, innovation is like the drawings in Leonardo da Vinci’s notebook—a “brilliant idea” but that is all.

Binary Classification Model for Customer Transaction Prediction Using Python (Balanced Boosting)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Balanced Boosting classifier (from the imbalanced-learn package) with inner balancing samplers to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.6844. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.6860. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.5681.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

False Merit, Appearing Real

In his podcast, Akimbo, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

Our understanding of science and engineering gives us highly reliable ways to evaluate the merit of some natural material for certain applications. But we face many challenges when we try to evaluate the merit of people for certain situations. Many questions arise like how to decide what merit is, what are the rules, and what are we measuring?

Studying organized sports can give us many insights into the evaluation of merits. The merits in baseball used to be solely about a player’s physical attributes and batting averages. The lesson from Billy Beane, as chronicled by Michael Lewis in his book Money Ball, showed other factors should be considered. By seeing what was the true “merits,” as opposed to merely associating past performance with guesses of future performance, Billy Beane was able to beat the system.

The college admission system is another area where the discussion of merit is always on-going. In the search for “well-rounded” candidates, colleges use various mechanisms such as grades, SAT, participation in sports to evaluate the merit of a student. As soon as an institution that can award merit chooses an aspect of performance that can be gamed, people will begin to cheat.

It is important to keep in mind that merit for one system does not dictate the value we can bring to the culture as a human being. Merit often begins in our head – we have been taught or brainwashed into believing we should value certain measurements much more than some others.

In our culture, we mistook a certain measurement as the true merit for making a selection.  For being successful in an academic environment, the SAT score does not measure our aptitude for anything. The SAT score is more of a measure of our economic settings and how much did our parents spend on preparing us for the test.

We should be mindful that one form of merit does not apply to all aspects of our lives. Scoring well on an SAT exam does not say anything about how successfully we will handle many other aspects in our school, work, and family. Our culture no longer has just one door where everyone must fit through to achieve merit. Our culture is being radically shifted in so many directions, and there are many doors that an individual can explore.

Perhaps our mindset on what merit we should be focusing on needs to change. One way to earn true merit is whether the merit is contributing, a contribution that we can be proud of.

What is much more important about merit is the story. We tell ourselves the story about the possibility and about our ability to contribute to the story. We tell ourselves about grit, about resilience, and about speaking up when we need to speak up. These days, we live in many circles, and there is not just one path. There are many ways to earn merit, so we should ask ourselves where can we contribute and how can we pick ourselves?

If we are going to seek merit from a system that awards merit based on corruption, we will inherently become corrupted. Instead, we can earn merit by picking ourselves and by giving ourselves the challenge of showing up, turning on a light, solving interesting problems and leading the others.

Binary Classification Model for Customer Transaction Prediction Using Python (Balanced Random Forest)

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Santander Bank Customer Transaction Prediction competition is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Santander Bank’s data science team wants to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The bank is continually challenging its machine learning algorithms to make sure they can more accurately identify new ways to solve its most common challenges such as: Will a customer buy this product? Can a customer pay this loan?

For this iteration, we will examine the effectiveness of the Balanced Random Forest classifier (from the imbalanced-learn package) with inner balancing samplers to mitigate the effect of imbalanced data for this problem. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

ANALYSIS: The baseline performance achieved an average ROC-AUC score of 0.8224. After a series of tuning trials, the top result from the training data was a ROC-AUC score of 0.8660. By using the optimized parameters, the algorithm processed the test dataset with a ROC-AUC score of 0.7761.

CONCLUSION: To be determined after comparing the results from other machine learning algorithms.

Dataset Used: Santander Customer Transaction Prediction

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

One potential source of performance benchmark: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

The HTML formatted report can be found here on GitHub.

Kaggle Competition: Banco Santander Customer Transaction Prediction Update 2

If you are new to Python machine learning like me, you might find the current Kaggle competition “Santander Customer Transaction Prediction” interesting.

The competition is essentially a binary classification problem with a decently large dataset (200 attributes and 200,000 rows of training data). I have not participated in Kaggle competition before and will use this one to get some learning under the belt.

I plan to run the training data through a list of machine learning algorithms (see below) and iterate them through three stages. This blog post will serve as the meta post that summarizes the progress.

The current plan with the milestones is as follow:

Stage 1: Gather the Baseline Performance.

  • LogisticRegression: completed and posted on Monday 25 February 2019
  • DecisionTreeClassifier: completed and posted on Wednesday 27 February 2019
  • KNeighborsClassifier: completed and posted on Friday 1 March 2019
  • BaggingClassifier: completed and posted on Sunday 3 March 2019
  • RandomForestClassifier: completed and posted on Monday 4 March 2019
  • ExtraTreesClassifier: completed and posted on Wednesday 6 March 2019
  • GradientBoostingClassifier: completed and posted on Friday 8 March 2019

Stage 2: Feature Selection using the Attribute Importance Ranking technique

  • BaggingClassifier: completed and posted on Wednesday 13 March 2019
  • RandomForestClassifier: completed and posted on Friday 15 March 2019
  • ExtraTreesClassifier: completed and posted on Sunday 17 March 2019
  • GradientBoostingClassifier: completed and posted on Monday 18 March 2019

Stage 3: Over-Sampling (SMOTE) and Balancing Ensembles techniques

  • LogisticRegression: completed and posted on Wednesday 20 March 2019
  • ExtraTreesClassifier: completed and posted on Friday 22 March 2019
  • RandomForestClassifier: planned for Monday 25 March 2019
  • GradientBoostingClassifier: planned for Wednesday 27 March 2019
  • Balanced Bagging: planned for Friday 29 March 2019
  • Balanced Boosting: planned for Sunday 31 March 2019
  • Balanced Random Forest: planned for Monday 1 April 2019

I post all Python scripts here on GitHub. The final submission deadline is 10 April 2019.

Feel free to take a look at the scripts and experiment. Who knows, you might have something you can turn in by the time April comes around. Happy learning and good luck!