Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.
INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.
The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.
The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?
ANALYSIS: The baseline performance of the ten algorithms achieved an average F1_Micro score of 0.9260. Two algorithms, Logistic Regression and Support Vector Machine, achieved the top F1_Micro scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an F1_Micro score of 0.9402. After using the optimized tuning parameters, the Support Vector Machine algorithm processed the validation dataset with an F1_Micro score of 0.9405, which was slightly better than using the training data.
CONCLUSION: For this iteration, the Support Vector Machine algorithm achieved the leading F1_Micro scores using the training and validation datasets. For this dataset, Support Vector Machine should be considered for further modeling or production use.
Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set
Dataset ML Model: Binary classification with numerical and categorical attributes
One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge
The HTML formatted report can be found here on GitHub.