但你是在做你的工作嗎?

(從我的一個喜歡與尊敬的作家,賽斯 高汀

這是一個提示:你真正的工作可能不是你想像的那樣。

有些醫生可能會認為他們的工作是治療疾病。

但事實上,這不是能讓病人一直來找你的原因。治愈是一個目標,這很重要,但這還不夠。

技術任務也很重要,但工作涉及的不止於此。

有些醫生為學術界做出貢獻,他們風度翩翩,總是花點時間為患者來做情感上的勞動,他們投入找員工和培訓,並將他們的辦公室放在醫療界好找的十字路口上,這羣醫生總是會比其他不做這些額外工作的醫生要做得更好。

對於那些認為工作只是輸入好的代碼的網頁設計師,或僅僅專注於食物的餐館老闆來說,情況也是如此。這些基本是很重要,但是今天工作的內容已遠遠超過典型職位描述中的內容

做你的工作並不總是和光做份內工作一樣。“憑感覺東西”可能比你想像的更重要。做這項工作是你購買的門票,用門票以獲得做其他部分的特權。

Binary Classification Model for Census Income Using Python Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has many cells with missing values, so we will examine the models by imputing the missing cells with a default value. This iteration of the project will produce a set of results that we will use to compare with the baseline models from Take 1.

CONCLUSION: From the previous iteration (Take 1), The baseline performance of the ten algorithms achieved an average accuracy of 81.37%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.99%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.23%, which was slightly better than the accuracy of the training data.

From this iteration (Take 2), the baseline performance of the ten algorithms achieved an average accuracy of 81.93%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 87.31%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.57%, which was slightly better than the accuracy of the training data.

For this project, imputing the missing values appeared to have contributed to a slight improvement of the overall accuracy of the training model. The Stochastic Gradient Boosting ensemble algorithm continued to yield consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Census+Income

One potential source of performance benchmark: https://www.kaggle.com/uciml/adult-census-income

The HTML formatted report can be found here on GitHub.

Making Ideas Travel, Part 1

In the podcast series, Seth Godin’s Startup School, Seth Godin gave a guided tour to a group of highly-motivated early-stage entrepreneurs on some of the questions they will have to dig deep and ask themselves while they build up their business. Here are my takeaways from various topics discussed in the podcast episodes.

  • Ideas are like riders, and they need a vehicle to get to us. Our ideas can be terrific, but if we cannot connect them to a medium for them to travel, nothing happens. The questions to ask are… Is our idea a good one? Does it hold up strategically? What do we use to connect it and to help it travel?
  • In the book “Positioning,” it says that one way to inject an idea into someone’s head is to find something that is already in someone’s head and hang something right next to it. Many ideas over time have built their structure in people’s heads by connecting to somebody else’s idea.
  • When our idea already has competitors in the same space, we have an opportunity to make a guerilla marketing jujitsu move that gives us a way to throw more weight to get our idea into someone else’s head. The approach is to make clever comparisons that, while our competitors are big and wonderful, we set ourselves apart by providing a list of benefits that our competitors do not.
  • Another approach is never to mention the competitors. That approach can work but will take a lot longer and cost a lot more money. The good news is that we get to build our own thing.
  • When pitching our ideas, we need to be mindful of our audience. We need to do the hard part by spending less time on describing what our idea is and more time on what our advantages are and why they are important. We need to weave together a story that shows our audience how our idea can address the problem and we had thought about the hard part of our idea.
  • For the carrier of the idea, the simpler technology the better. Email is better than a website for carrying and reading the idea. Work on accumulating a list of people who want to hear from you. The problem with the website is that people must remember to go back to it and they usually will not remember. People look at websites and expect that the website must look pretty and beautiful. Another problem with the website is that it is hard to share a website. Properly formed HTML emails do not have any of these problems, and they are easy for people to share with others.
  • In the world full of breaking news, our message will not be another breaking news. Instead, our message should be the important news to someone. We need to shape our messages in a way that the recipients want to share it. Also, numbers by themselves usually do not resonate enough with the people. However, the story of numbers is what people care about. People also want to hear a story that they can build on to brag about to our employees, our co-workers, and our boss. The story is the money spent here and why that money is smarter. People also want to hear about the details, details about our successes or the promise of success.

Origin Stories

In his podcast, Akimbo, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

  • Everywhere we look there are origin stories and the heroes in the story. The origin story often has two components. The first part is something that happened to our hero, and the second part is a choice our hero made. That choice opens doors or possibilities, and they set that hero on a path to his/her destination.
  • These stories have far more impact on our choices and our culture than we usually give them credit for. Our origin story can change how we see ourselves and the path going forward. For many of us, this idea, which we can choose a narrative (our origin story), begins to open doors for us. The story can change how we see our next challenge.
  • The Stanford marshmallow experiment was educational and gave us important insights into this topic. First, it would be wrong to conclude that there is something innate about somebody’s ability to wait for rewards. Our tendency of practicing delayed gratification has more to do with the origin story we tell ourselves. If we grow up in an environment where resource scarcity rules and trust is difficult to come by, it should not surprise us when we act with our survival-first instinct.
  • The origin story impacts how an organization sees itself and which direction it takes. Western Union/telephone, AOL/web, and Yahoo/search are some enlightening examples where the organizational origin story was too entrenched for an organization to see clearly. In the three example, the companies, once considered industry leaders, either failed or did not wish to see how they need to change their stories when the technology/industry shifted.
  • We need to exam our origin story from time to time and think critically about how it affects how we see our environment and challenges. What was the event that got us down this path and what was the choice we made after the event occurred? We also need to realize that our origin story does not have to match our ultimate destiny. We can change our origin story, and we can make a choice for ourselves and for the people we lead.

Binary Classification Model for Census Income Using Python Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Census Income dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data was extracted from the 1994 Census Bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over 50K a year.

This dataset has many cells with missing values, so we will examine the models by deleting the rows with missing cells. This iteration of the project will produce a set of baseline results that we can use to compare with other data cleaning methods.

CONCLUSION: The baseline performance of the ten algorithms achieved an average accuracy of 81.37%. Four ensemble algorithms (Bagged CART, Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 86.99%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm further processed the validation dataset with an accuracy of 87.23%, which was slightly better than the accuracy of the training data.

For this project, the Stochastic Gradient Boosting ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Census Income Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Census+Income

One potential source of performance benchmark: https://www.kaggle.com/uciml/adult-census-income

The HTML formatted report can be found here on GitHub.

The Customer Development Manifesto

In the book, The Startup Owner’s Manual, authors Steve Blank and Bob Dorf outlined 14 rules that make up The Customer Development Manifesto. The manifesto discussed the must-have elements for startups to follow and to avoid costly mistakes.

I also found another somewhat expanded list on Steve Blank’s website and thought to keep both lists close-by for the ease of search.

  1. There Are No Facts Inside Your Building, So Get Outside.
  2. Pair Customer Development with Agile Development
  3. Failure is an Integral Part of the Search
  4. Make Continuous Iterations and Pivots
  5. No Business Plan Survives First Contact with Customers So Use a Business Model Canvas
  6. Design Experiments and Test to Validate Your Hypotheses
  7. Agree on Market Type. It Changes Everything
  8. Startup Metrics Differ from Those in Existing Companies
  9. Fast Decision-Making, Cycle Time, Speed and Tempo
  10. It’s All About Passion
  11. Startup Job Titles Are Very Different from a Large Company’s
  12. Preserve All Cash Until Needed. Then Spend.
  13. Communicate and Share Learning
  14. Customer Development Success Begins with Buy-In

Multi-Class Classification Model for Wine Quality Using R

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Wine Quality dataset is a multi-class classification situation where we are trying to predict one of the three possible outcomes (cheap, average, and good).

INTRODUCTION: The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). The goal is to model wine quality based on physicochemical tests.

From the previous iteration, we approached the dataset as a regression problem and tried to predict the wine quality (a continuous numeric variable) with the least amount of mean squared error. While regression is one approach for assessing the wine quality, expressing quality in pure numbers and fractions are difficult for people to grasp fully.

For this iteration of the project, we will approach this dataset as a multi-class problem and attempt to classify the wine quality into one of the three rating categories: 1-Good (quality 7 or above), 2-Average (quality of 5-6), and 3-Cheap (quality 4 or below).

CONCLUSION: The baseline performance of the seven algorithms achieved an average accuracy of 80.08%. Three ensemble algorithms (Bagged Decision Trees, Random Forest, and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 85.03%. With the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an accuracy of 83.98%, which was slightly worse than the accuracy of the training data.

For this project, predicting whether a bottle of wine would be good, average, or cheap appears to be more intuitive than to predict simply a numerical quality score. The Random Forest ensemble algorithm yielded consistently top-notch training and validation results, which warrant the additional processing required by the algorithm.

Dataset Used: Wine Quality Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/wine+quality

One potential source of performance benchmarks: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

The HTML formatted report can be found here on GitHub.

想有好主意的兩個簡單秘訣

(從我的一個喜歡與尊敬的作家,賽斯 高汀

秘訣一:

最大的一個:也是出差勁的想法。越多差勁的想法反而越好。如果你努力工作,雖然會提出些不好的想法,但往往遲早會產生一些更好的想法。這比沒有「壞的想法﹞的觸動,相反來說容易得多。

秘訣二:

更重要的是:要學會慷慨。慷慨的為別人提出好點子,這比光為自己想要更容易,也更有效。這代表著當我們為其他人著想時,這種關懷的姿態也會給自己更容易的帶來洞察力和不同的設想。由這來產生了新的思路,它也可以讓你擺脫原來思考的困境。