Updated Machine Learning Templates for R

As I work on practicing and solving machine learning (ML) problems, I find myself repeating a set of steps and activities repeatedly.

Thanks to Dr. Jason Brownlee’s suggestions on creating a machine learning template, I have pulled together a set of project templates that can be used to support regression ML problems using R.

Version 5 of the templates contain several minor adjustments and corrections to address discrepancies in the prevision versions of the template.

The new templates also standardized the dataframes used in the script as follow:

originalDataset: This dataframe contains the original data imported from the data source.

xy_train: Training dataframe that has the attributes and the target/class variable.

x_train: Training dataframe that has the attributes only.

y_train: Training dataframe that has the target/class variable only.

xy_test: Test dataframe that has the attributes and the target/class variable.

x_test: Test dataframe that has the attributes only.

y_test: Test dataframe that has the target/class variable only.

You will find the R templates from the Machine Learning Project Templates page.

Honest Signals

In his podcast, Akimbo [https://www.akimbo.me/], Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

Through the evolution process, animals have developed various traits that serve as signals to others. Some signals are useful to fend off predators, and some signals are useful for mating and reproduction purposes. We the humans, through the conditioning of culture, also send many signals to other humans.

Cal Newport has written about how ineffective open offices are. For many types of work that require highly concentrated effort, open offices do not make sense. So why build one? Newport argued that it is a signaling strategy to investors and people we are going to hire next. The signal is that we are so smart, so productive, and making a ruckus, we can afford to cram people into this bullpen, so do you want to join us? Even though it is less productive, it appeals to a certain kind of employee and a certain kind of investor.

The first consideration for signaling strategy is are they honest signals or dishonest signals. We have a choice when we send signals to the world, and it is challenging to send dishonest signals. If we are bluffing and it fails to work, we lose creditability and very difficult to undo these days. When we go into the marketplace, we must decide whether we have built up enough of a resource, skill, capital, or reputation for sending the honest signals? Will we choose to invest or to bluff?

The second consideration is that many people, particularly people without a lot of experience, don’t know the signals they are supposed to send. As a result, they have talent that does not get found. If you have talent, passion, and skill, signals matter. Figure out what the signals are and over-invest in them. When we over-invest in our signals, they are more likely to work.

If we are looking for a resource, the obvious way to get a bigger return on investment is to ignore the expensive signals our competitors look for and to find other signals. This is the theory behind Michael Lewis’s Moneyball. Billy Beane came up with a new statistical way to find talent, and he was able to get players who were way cheaper. He was looking for signals that correlated with their real skills. The insightful thing to do is to look at which signals we think matter and get to those who are signaling the ones that do matter.

Thanks to the Internet, it is easier than ever to figure out which signals that the masses are looking at and to fake them. We harm ourselves all the time in the world of social media because people are intentionally or artificially boosting the signals. Once we realize that a signal has been corrupted, we have a choice. We can embrace the fact that other people are still looking at the old, corrupted signal. Or we can walk away and invent new, more honest signals that we want to live with. At times, we may need to do both. As diversity creates more value, we are going to have to discard many of the old signals and embrace new ones, ones that are more relevant and useful going forward.

Multi-Class Classification Model for Human Activity Recognition with Smartphone Using R Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance regarding accuracy and processing time.

In iteration Take2, we examined the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we explored was to eliminate collinear attributes based on a threshold of 85%.

In iteration Take3, we explored the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to cumulative importance of 0.99.

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 50.

CONCLUSION: From the previous iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 91.67%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.84%. Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 95.49%, which was slightly below the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 90.83%. Three algorithms (Random Forest, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.07%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.96%, which was slightly worse than the accuracy from the training data and possibly due to over-fitting.

From the previous iteration Take3, the baseline performance of the ten algorithms achieved an average accuracy of 91.59%. the Random Forest and Stochastic Gradient Boosting algorithms achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.74%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.42%. The accuracy on the validation dataset was slightly worse than the training data and possibly due to over-fitting.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 90.62%. Three algorithms (Bagged CART, Random Forest, and Stochastic Gradient Boosting) achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 97.75%. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 87.21%. The accuracy on the validation dataset was noticeably worse than the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 561 down to 41 after eliminating 520 variables. The processing time went from 8 hours 16 minutes in iteration Take1 down to 2 hours and 25 minutes in iteration Take4. That was a noticeable reduction in comparison to Take2, which reduced the processing time down to 7 hours 15 minutes. It also was a noticeable reduction in comparison to Take3, which reduced the processing time down to 5 hours 22 minutes.

In conclusion, the attribute importance ranking technique helped by cutting down the attributes and reduce the training time. Furthermore, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Stochastic Gradient Boosting algorithm with attribute importance ranking should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.

Multi-Class Classification Model for Human Activity Recognition with Smartphone Using Python Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance in terms of accuracy and processing time.

In iteration Take2, we examined the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we will explore is to eliminate collinear attributes based on a threshold of 85%.

In iteration Take3, we explored the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to cumulative importance of 0.99.

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time managable, we will limit the number of attributes to 50.

CONCLUSION: From the previous iteration Take1, the baseline performance of the ten algorithms achieved an average accuracy of 84.68%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top result using the training data. It achieved an average accuracy of 95.43%. Using the optimized tuning parameter available, the Linear Discriminant Analysis algorithm processed the validation dataset with an accuracy of 96.23%, which was even better than the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 83.54%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 93.34%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 93.82%, which was slightly better than the accuracy from the training data.

From the previous iteration Take3, the baseline performance of the ten algorithms achieved an average accuracy of 85.49%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Linear Discriminant Analysis turned in the top result using the training data. It achieved an average accuracy of 95.52%. Using the optimized tuning parameter available, the Linear Discriminant Analysis algorithm processed the validation dataset with an accuracy of 96.06%, which was slightly better than the accuracy from the training data.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 86.76%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an average accuracy of 95.83%. Using the optimized tuning parameter available, the Support Vector Machine algorithm processed the validation dataset with an accuracy of 94.19%, which was slightly below the accuracy from the training data.

From the model-building activities, the number of attributes went from 561 down to 50 after eliminating 511 variables that fell below the required importance. The processing time went from 8 hours 16 minutes in iteration Take1 down to 1 hours and 16 minutes in iteration Take4. That was a minor reduction in comparison to Take2, which reduced the processing time down to 2 hours 7 minutes. It also was a noticeable reduction in comparison to Take3, which reduced the processing time down to 8 hours and 9 minutes.

In conclusion, the importance ranking technique should have benefited the tree methods the most, but the Linear Discriminant Analysis algorithm held its own for this modeling iteration. Furthermore, by reducing the collinearity, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Linear Discriminant Analysis and Support Vector Machine algorithms should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.

日常

(從我的一個喜歡與尊敬的作家,賽斯 高汀

你每一天都能為你自己建立資產嗎?

每一天嗎?

是什麼造就了屬於你的另一部分知識產權?

是什麼使您擁有的資產轉換成更有價值的東西?

你真正學到了些什麼?

每一天可以累積成很多天。從長遠來看,這也很容易讓我們自已再拖延一天。

但是長遠是從短期持之以恆來累積出來的。

Multi-Class Classification Model for Human Activity Recognition with Smartphone Using R Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance regarding accuracy and processing time.

For iteration Take2, we will examine the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we explored was to eliminate collinear attributes based on a threshold of 85%.

For this iteration, we will explore the dimensionality reduction technique of ranking the importance of the attributes with a gradient boosting tree method. Next, we eliminate the features which do not contribute to the cumulative importance of 0.99.

CONCLUSION: From the previous iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 91.67%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.84%. Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 95.49%, which was slightly below the accuracy from the training data.

From the previous iteration Take2, the baseline performance of the ten algorithms achieved an average accuracy of 90.83%. Three algorithms (Random Forest, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.07%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.96%. The accuracy on the validation dataset was slightly worse than the training data and possibly due to over-fitting.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 91.59%. The Random Forest and Stochastic Gradient Boosting algorithms achieved the top two accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.74%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.42%. The accuracy on the validation dataset was slightly worse than the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 561 down to 79 after eliminating 482 variables that fell below the required importance. The processing time went from 8 hours 16 minutes in iteration Take1 down to 5 hours and 22 minutes in iteration Take3. That was also a noticeable reduction in comparison to Take2, which reduced the processing time down to 7 hours 15 minutes.

In conclusion, the importance ranking technique should have benefited the tree methods the most, and it did. Furthermore, by reducing the number of attributes, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Stochastic Gradient Boosting algorithm should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.

Drucker on Executive Realities

In his book, The Essential Drucker: The Best of Sixty Years of Peter Drucker’s Essential Writings on Management, Peter Drucker analyzed the ways that management practices and principles affect the performance of organizations, individuals, and society. The book covers the basic principles of management and gives professionals the tools to perform the tasks that the environment of tomorrow will require of them.

These are my takeaways from reading the book.

Drucker had described the knowledge worker in a modern organization as an “executive.” The realities of the knowledge workers’ situation both demand effectiveness from the executives but also make it very difficult to achieve effectiveness.

Many activities inside an organization have to do with effort and cost. To put it bluntly, the less (effort) an organization must do to produce results, the better it does its job. To produce the same result with less effort, the people (knowledge workers or executives) in an organization need to learn to be more effective.

There are four major realities over which the executives essentially have no control. These realities are part of the executives’ day and work. The executives have no choice but to “cooperate with the inevitable.” At the same time, every one of these four realities also exerts pressure on the executives. If the executives do not handle these realities effectively, the executive can expect very low performance and a poor result.

Reality No.1: The executive’s time tends to belong to everybody else. We can also describe executive as a captive of the organization. Everybody can move in on his time, and everybody does.

Reality No.2: Executives are forced to keep on “operating” unless they take positive action to change the reality in which they live and work. In most organizations, the executive’s work is mostly reactive. They often do not have control over the events or activities they must respond and address.

Reality No.3: The executive is constantly being pushed towards effectiveness because he is effective only when other people make use of what he contributes. The knowledge worker’s output, by itself, usually does not produce the required results until someone else acts on the output. This makes the executive’s time other people’s resources depending on the results the organization seeks.

Reality No.4: The executive exists within an organization. Specifically, there are no results within the organization. All the results are on the outside. The only business results, for instance, are produced by a customer who converts the costs and efforts of the business into revenues and profits through his willingness to exchange his purchasing power for the products or services of the business.

These executives inside an organization cannot change these four realities. They are necessary conditions of his existence, but, if leaving them unchecked, these four forces can also push the executives toward average or low performance. That means the knowledge workers in their organizations must make special efforts to learn to be effective or the ineffectiveness and poor results will surely set in.

Multi-Class Classification Model for Human Activity Recognition with Smartphone Using R Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Human Activities with Smartphone Dataset is a multi-class classification situation where we are trying to predict one of the six possible outcomes.

INTRODUCTION: Researchers collected the datasets from experiments that consist of a group of 30 volunteers with each person performed six activities wearing a smartphone on the waist. With its embedded accelerometer and gyroscope, the research captured measurement for the activities of WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. The dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% of the test data.

In iteration Take1, the script focuses on evaluating various machine learning algorithms and identify the algorithm that produces the best accuracy metric. Iteration Take1 established a baseline performance regarding accuracy and processing time. For this iteration, we will examine the feasibility of using dimensionality reduction techniques to reduce the processing time while still maintaining an adequate level of prediction accuracy. The first technique we will explore is to eliminate collinear attributes based on a threshold of 85%.

CONCLUSION: From the previous iteration Take 1, the baseline performance of the ten algorithms achieved an average accuracy of 91.67%. Three algorithms (Linear Discriminant Analysis, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.84%. Stochastic Gradient Boosting also processed the validation dataset with an accuracy of 95.49%, which was slightly below the accuracy from the training data.

From the current iteration, the baseline performance of the ten algorithms achieved an average accuracy of 90.83%. Three algorithms (Random Forest, Support Vector Machine, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 98.07%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 93.96%, which was slightly worse than the accuracy from the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 561 down to 192 after eliminating 369 variables that are at least 85% collinear. The processing time went from 21 hours 43 minutes in iteration Take1 down to 7 hours and 15 minutes in iteration Take2. That was a reduction in model training and processing time of 66%.

In conclusion, the reduction in the number of attributes used still achieved an acceptable level of accuracy. by reducing the collinearity, the modeling took a much shorter time to process yet still retained decent accuracy. For this dataset, the Stochastic Gradient Boosting algorithm should be considered for further modeling or production use.

Dataset Used: Human Activity Recognition Using Smartphone Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

One potential source of performance benchmarks: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

The HTML formatted report can be found here on GitHub.