Updated Machine Learning Templates v11 for Python

As I work on practicing and solving machine learning (ML) problems, I find myself repeating a set of steps and activities repeatedly.

Thanks to Dr. Jason Brownlee’s suggestions on creating a machine learning template, I have pulled together a set of project templates that can be used to support regression ML problems using Python.

Version 11 of the templates contain minor adjustments and corrections to the prevision version of the templates. Also, the new templates added or updated the sample code to support:

  • Re-organization of the training and test dataset naming scheme (X_train, y_train, X_test, y_test, etc.). Made changes to something that are more consistent with good industry practices.
  • Re-organization of the feature selection and data transformation sections. Made changes to make the overall flow of the script more logical and consistent.

You will find the Python templates from the Machine Learning Project Templates page.

Possibility and Enrollment

In his podcast, Akimbo, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth discusses possibility and enrollment. What they mean and how they relate.

As we build our culture and teach our children, the notion of possibility is critically important. Possibility means when we invest the effort, something might come of it. Possibility also is about openness and fairness.

Today the richness we enjoy with literature, music, and knowledge became possible, not because they came naturally but because someone made them possible. We need to teach people to realize that better things are possible.

Enrollment is a critical element that is mostly missing from many human endeavors, such as education, marketing, and politics. Enrollment means that something we do is voluntary. When we do something that could turn into a possibility, it inevitably gets difficult. Enrollment is what pushes us through that difficulty.

Enrollment is hard because we need to commit. We need to bring grit to the table. The grit means showing up and facing possible rejection. The grit means to do it wrong, maybe over and over, before doing it right. The grit also means always trying to discover the best way forward.

We can think of how possibility and enrollment interact with the work we do in one of the four ways.

(High Possibility and High Enrollment). Entrepreneurship is one such example. We can imagine the possibility of coming out of the dip on the other side. We are also eager to put forth the effort required to get through the dip.

(Low Possibility and Low Enrollment) This usually results in some work that is too exotic or obscure to be valuable. Also, there is probably not a good system set up to facilitate those who want to take on this type of endeavor.

(Low Possibility and High Enrollment) Sports and acting are two such examples. Many people may want to be an NBA star but only very few succeed. This type of work usually requires some innate talent and a tremendous amount of preparation and practice.

(High Possibility and Low Enrollment) Nursing and many professionals fall into this category. The profession can be rewarding once we have mastered it, but relatively few people are willing to invest effort.

Learning and education work best when enrollment is high, but today’s education focuses mostly on compliance and outdated measurements. We have been teaching our children to get good grades or just to survive school, but neither leads to meaningful learning.

Leadership work is also a function of possibility and enrollment. Management work requires neither. The people who work for the manager are not necessarily seeing the possibility nor are they enrolled in the journey. They are just there to do their work and get paid.

A leader needs to show the people the possibilities from the change we seek to make and set up a system of enrollment for those who want to be on that same journey. When people are engaging with us to go somewhere, we are not sure it is going to work, we need a system of enrollment to help them. The cultural system we create for enrollment can reinforce the sense of enrollment. People like us are marching along and lining up toward this cause.

We can serve up the possibilities on a platter without enrollment, and people will probably take them. However, when going gets difficult, people probably will bailout.  The grit required for a worth-a-while journey is expensive, and not many people are willing to expend the energy to do something that might not work.

When we design a system for change, we need to build in-demand creation, the features, the benefits, and so on. We need those elements to communicate the possibilities that can come from the change. Just as importantly, we need to build a systemic cultural approach that creates enrollment. The system that makes it clear “People like us extend ourselves through things like this.” The system that shows us our status in the hierarchy and our position among the people we care about.

Creating possibilities and forming enrollment are hard work. Once we get on track, enrollment begets more enrollment and possibility begets more possibility. When we turn on the lights for ourselves, we do it for other people as well.

Multi-Class Classification Model for Forest Cover Type Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of the seven possible outcomes.

INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 78.04%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 85.48%. By using the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 86.07%, which was even better than the predictions from the training data.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.

Dataset Used: Covertype Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype

One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview

Dataset Used: Cover Type Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype

One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview

The HTML formatted report can be found here on GitHub.

Web Scraping of useR! 2019 Conference Using Python

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: useR! Conference is an annual meeting that focuses on supporting the R community and ecosystem. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process. The Python script ran in the Google Colaboratory (Colab) environment and can be adapted to run in any Python environment without the Colab-specific configuration.

Starting URLs: https://user2019.r-project.org/talk_schedule/

The source code and HTML output can be found here on GitHub.

單人馬拉松

(從我一個尊敬的作家,賽斯 高汀

通常人們跑的馬拉松,這流行的馬拉松,是有團隊來主辦完成的。

他們有個開始的時間。

有個終點線。

有獲得資格的方式。

有條固定的路線。

有一群人報名參加。

並提前一年前就已公佈日期。

在大多數的情況下,這些馬拉松都充滿了興奮,精力,和同伴的壓力。

但還有另一種馬拉松是任何人都可以參加的比賽。在任何一天,穿上你的運動鞋,跑出門然後二十六英里後回來。這些倒是很少見的。

值得我們注意的是,在我們創建項目,在創辦企業,或在發展事業方面做的很多事情,都與這第二種馬拉松很類似。

這也難怪單人的馬拉松會這麼難持續。

Multi-Class Classification Model for Forest Cover Type Using Python Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of the seven possible outcomes.

INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 78.04%. Two algorithms (Bagged Decision Trees and Extra Trees) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an accuracy metric of 85.80%. By using the optimized parameters, the Extra Trees algorithm processed the testing dataset with an accuracy of 86.50%, which was even better than the predictions from the training data.

CONCLUSION: For this iteration, the Bagged Decision Trees algorithm achieved the best overall results using the training and testing datasets. For this dataset, Extra Trees should be considered for further modeling.

Dataset Used: Cover Type Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype

One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview

The HTML formatted report can be found here on GitHub.

Drucker on Being the Change Leader, Part 5

In his book, Management Challenges for the 21st Century, Peter Drucker analyzed and discussed the new paradigms of management.

Although much of the discussion revolves around the perspective of the organization, these are my takeaways on how we can apply his teaching on our journey of being a knowledge worker.

Drucker asserted that “One cannot manage change. One can only be ahead of it.”

While many of us seek the comfort of stability and status quo, the world rarely cares about what we want. In a period of upheavals with rapid change being the norm, the only ones who survive are the Change Leaders.

A change leader sees change as an opportunity. A change leader looks for change, learns how to find the right changes, and work to make them effective both outside and inside of the organization. Change leaders need to be aware of these four elements.

Policies to make the future

Systematic methods to look for and to anticipate change

The right way to introduce change, both within and outside the organization

Policies to balance change and continuity

In the end, Drucker asserted that we face long years of profound changes in demographics, in politics, in society, in philosophy and, above all, in worldview. Changes in belief are difficult to theorize in the period of the change. Only when such a period is over, perhaps decades later, we begin to have theories developed to explain what has happened.

At the same time, it is futile to try to ignore the changes and to pretend that tomorrow will be like yesterday. This is the position that existing institutions tend to adopt in such a period of change. When an organization suffers from such delusion, they become a visible target for a disruptor or challenger to take their place in the market.

The only thing left we can confidently predict is that many of today’s leaders in all areas are unlikely still to be around thirty years and certainly not in their present form. But to try to anticipate what the changes will be is equally difficult. These changes are not predictable.

This leads us to the only change management policy likely to succeed is to try to make the future. Even with the constraints we face in our environment, Drucker believed the future is still malleable. We can still create the future we seek.

This brings us to Drucker’s final point about being the change leader. Trying to make the future can be highly risky. However, it is less risky than simply not doing anything or pretend that the changes will not affect us.

Web Scraping of useR! 2019 Conference Using R

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping R code leverages the rvest package.

INTRODUCTION: useR! Conference is an annual meeting that focuses on supporting the R community and ecosystem. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://user2019.r-project.org/talk_schedule/

The source code and HTML output can be found here on GitHub.