Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance in terms of accuracy and processing time.

In iteration Take2, we examined the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we will explore is to eliminate collinear attributes based on a threshold of 85%.

In iteration Take3, we explored the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to cumulative importance of 0.99.

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time managable, we will limit the number of attributes to 50.

CONCLUSION: From the previous iteration Take1, the baseline performance of the ten algorithms achieved an average accuracy of 84.68%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top result using the training data. It achieved an average accuracy of 95.43%. Using the optimized tuning parameter available, the Linear Discriminant Analysis algorithm processed the validation dataset with an accuracy of 96.23%, which was even better than the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 83.54%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 93.34%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 93.82%, which was slightly better than the accuracy from the training data.

From the previous iteration Take3, the baseline performance of the ten algorithms achieved an average accuracy of 85.49%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top result using the training data. It achieved an average accuracy of 95.52%. Using the optimized tuning parameter available, the Linear Discriminant Analysis algorithm processed the validation dataset with an accuracy of 96.06%, which was slightly better than the accuracy from the training data.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 86.76%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 95.83%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 94.19%, which was slightly below the accuracy from the training data.

From the model-building activities, the number of attributes went from 561 down to 50 after eliminating 511 variables that fell below the required importance. The processing time went from 8 hours 16 minutes in iteration Take1 down to 1 hours and 16 minutes in iteration Take4. That was a minor reduction in comparison to Take2, which reduced the processing time down to 2 hours 7 minutes. It also was a noticeable reduction in comparison to Take3, which reduced the processing time down to 8 hours and 9 minutes.

In conclusion, the importance ranking technique should have benefited the tree methods the most, but the Linear Discriminant Analysis algorithm held its own for this modeling iteration. Furthermore, by reducing the collinearity, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Linear Discriminant Analysis and Support Vector Machine algorithms should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.