Binary Classification Model for Caravan Insurance Marketing Using Python Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.

The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.

The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?

In iteration Take1, we had algorithms with high accuracy but with strong biases due to the imbalance of our dataset. For this iteration, we will examine the feasibility of using the SMOTE technique to balance the dataset.

ANALYSIS: From the previous Take1 iteration, the baseline performance of the ten algorithms achieved an average F1_Micro score of 0.9260. Two algorithms, Logistic Regression and Support Vector Machine, achieved the top F1_Micro scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an F1_Micro score of 0.9402. After using the optimized tuning parameters, the Support Vector Machine algorithm processed the validation dataset with an F1_Micro score of 0.9405, which was slightly better than using the training data.

From the current iteration, the baseline performance of the eight algorithms achieved an average F1_Micro score of 0.9326. Two algorithms, Random Forest and Extra Trees, achieved the top F1_Micro scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an F1_Micro score of 0.9595. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with an F1_Micro score of 0.9165, which was noticeably worse than using the training data and perhaps due to overfitting.

CONCLUSION: For this iteration, the SMOTE technique improved the unbalanced dataset we have but did not improve the algorithm’s final performance metric. Overall, the Random Forest algorithm achieved the leading F1_Micro scores using the training dataset, but the model failed to perform adequately using the validation dataset. For this dataset, Random Forest still should be considered for further modeling and testing before making it available for production use.

Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)

One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge

The HTML formatted report can be found here on GitHub.