Binary Classification Model for MiniBooNE Particle Identification Using R Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The MiniBooNE Particle Identification dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background). The data file is set up as follows. In the first line is the number of signal events followed by the number of background events. The records with the signal events come first, followed by the background events. Each line, after the first line, has the 50 particle ID variables for one event.

From the previous iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the model that produces the best overall metrics. Iteration Take1 established the baseline performance for accuracy and processing time.

From the previous iteration Take2, we examined the feature selection technique of eliminating collinear features. By eliminating the collinear features, we hoped to decrease the processing time and maintain a similar level of accuracy compared to iteration Take1.

From the previous iteration Take3, we examined the feature selection technique of attribute importance ranking. By taking only the most important attributes, we hoped to decrease the processing time and maintain a similar level of accuracy compared to iterations Take1 and Take2.

In the current iteration Take4, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain.

ANALYSIS: From the previous iteration Take1, the baseline performance of the eight algorithms achieved an average accuracy of 90.82%. Two algorithms (Bagged CART and Random Forest) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 93.74%. By optimizing the tuning parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 93.91%, which was even better than the training data.

From the previous iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 90.04%. Two algorithms (Stochastic Gradient Boosting and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top overall result and achieved an accuracy metric of 93.47%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 93.57%, which was even better than the training data.

From the previous iteration Take3, the baseline performance of the machine learning algorithms achieved an average accuracy of 90.49%. Two algorithms (Stochastic Gradient Boosting and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 93.52%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 93.74%, which was even better than the training data.

In the current iteration Take4, the baseline performance of the machine learning algorithms achieved an average accuracy of 90.79%. Two algorithms (Stochastic Gradient Boosting and Random Forest) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 93.??%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 93.??%, which was even better than the training data.

From the model-building perspective, the number of attributes decreased by 10, from 50 down to 40 in iteration Take4. The processing time went from 17 hours 18 minutes in iteration Take1 down to 30 hours 58 minutes in Take3, which was an increase of 21.5% from Take1. It was a significant increase in comparison to Take2, which had a processing time of 12 hours 17 minutes. It was also a significant increase in comparison to Take3, which had a processing time of 6 hours 48 minutes.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results with an increased processing time after running Recursive Feature Elimination. For this dataset, the Stochastic Gradient Boosting and Random Forest algorithms should be considered for further modeling or production use.

Dataset Used: MiniBooNE particle identification Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification

The HTML formatted report can be found here on GitHub.