If you are new to Python machine learning like me, you might find the current Kaggle competition “Santander Customer Transaction Prediction” interesting.
The competition is essentially a binary classification problem with a decently large dataset (200 attributes and 200,000 rows of training data). I have not participated in Kaggle competition before and will use this one to get some learning under the belt.
I plan to run the training data through a list of machine learning algorithms (see below) and iterate them through three stages. This blog post will serve as the meta post that summarizes the progress.
The current plan with the milestones is as follow:
Stage 1: Gather the Baseline Performance.
- LogisticRegression: completed and posted on Monday 25 February 2019
- DecisionTreeClassifier: completed and posted on Wednesday 27 February 2019
- KNeighborsClassifier: completed and posted on Friday 1 March 2019
- BaggingClassifier: completed and posted on Sunday 3 March 2019
- RandomForestClassifier: completed and posted on Monday 4 March 2019
- ExtraTreesClassifier: completed and posted on Wednesday 6 March 2019
- GradientBoostingClassifier: completed and posted on Friday 8 March 2019
Stage 2: Feature Selection using the Attribute Importance Ranking technique
- LogisticRegression: planned for Monday 11 March 2019
- BaggingClassifier: planned for Wednesday 13 March 2019
- RandomForestClassifier: planned for Friday 15 March 2019
- ExtraTreesClassifier: planned for Sunday 17 March 2019
- GradientBoostingClassifier: planned for Monday 18 March 2019
Stage 3: Over-Sampling and Balancing Ensembles techniques (TBD)
I will post all Python script in a folder on GitHub. The final submission deadline is 10 April 2019.
Feel free to take a look at the scripts and experiment. Who knows, you might have something you can turn in by the time April comes around. Happy learning and good luck!