Binary Classification Model for MiniBooNE Particle Identification Using Python Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The MiniBooNE Particle Identification dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background). The data file is set up as follows. In the first line is the number of signal events followed by the number of background events. The records with the signal events come first, followed by the background events. Each line, after the first line, has the 50 particle ID variables for one event.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 90.58%. Two algorithms (Bagged CART and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top overall result and achieved an accuracy metric of 93.95%. By using the optimized parameters, the Stochastic Gradient Boosting algorithm processed the testing dataset with an accuracy of 93.85%, which was just slightly below than the training data.

CONCLUSION: For this iteration, the Stochastic Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, the Stochastic Gradient Boosting algorithm should be considered for further modeling or production use.

Dataset Used: MiniBooNE particle identification Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification

The HTML formatted report can be found here on GitHub.

Web Scraping of NeurIPS Proceedings Using Python and Scrapy

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the Scrapy framework.

INTRODUCTION: The Neural Information Processing Systems Conference (NeurIPS) hosts its collections of papers on the website, https://papers.nips.cc/. This web scraping script will automatically traverse through the listing and individual paper pages of the 2017 conference and collect all links to the PDF documents. The script will also download the PDF documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-30-2017

The source code and JSON output can be found here on GitHub.

虚幻的限制

(從我的一個喜歡與尊敬的作家,賽斯 高汀

你肯定會遇到真正的限制。你不能讓自己變成隱形,或是舉重三千磅,或是用牛排刀來做心臟移植手術。

但真正的限制很容易區別。我們很少不會發現它們。

虚幻的限制,或是讓其他人給我們的限制,那些才是一個真正的問題所在。即使限制之後是一片好心,那些試圖讓我們免於心碎或浪費精力的好意,這些出於善意的限制會可能成為一種習慣,但那不是有用的東西。

我昨天收到從約克社區學院的老師來的一封信。他寫道,“鼓勵任何人成為一名關鍵人物是一項非常不好的建議,不論是對於個人的追求或是公司來允許。在你鼓吹他人去實現這個想法之前,請先把這些利害關係搞清楚。”

我代表那老師班上熱切的學生感到沮喪和難過。那些自己從口袋裡掏錢,從工作和家庭中抽出時間,努力的做工作來推動自己升級的學生。難過是他們遇到一位不相信他們能有所作為的老師的人。

毫無疑問,工業家一直通過可互換的工人與低技能和低薪酬來獲得巨大的利潤。但這並不意味著你需要成為那些可互換的工人之一。

而且,也是毫無疑問,多數組織只需要做他們昨天所做的工作,也許做的更快或更便宜。但這並不意味著這必須是你應該找的工作。

成為一名關鍵人物的目標是通過製作更好的東西來使生活變得更好。在成功的邊緣尋求,不忽略任何可能,來創造和作貢獻,來學習和出成果。

這種作法總是會成功的嗎?那不一定,而且通常幾乎不會。但這是一個有成功可能性的道路。如果你相信某人來教你做一件事情,那麼當他們認為你有成功的可能,而且當他們認為你有能力來完成這困難的工作或是解決困難的問題,那會有很大的幫助。

未來是由那些想改變過去的人來定義的。我們需要你來做帶動。

Binary Classification Model for MiniBooNE Particle Identification Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The MiniBooNE Particle Identification dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background). The data file is set up as follows. In the first line is the number of signal events followed by the number of background events. The records with the signal events come first, followed by the background events. Each line, after the first line, has the 50 particle ID variables for one event.

ANALYSIS: The baseline performance of the eight algorithms achieved an average accuracy of 90.82%. Two algorithms (Bagged CART and Random Forest) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an average accuracy of 93.74%. By optimizing the tuning parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 93.91%, which was even better than the training data.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, the Random Forest algorithm should be considered for further modeling or production use.

Dataset Used: MiniBooNE particle identification Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification

The HTML formatted report can be found here on GitHub.

Drucker on Managing Our Time

In his book, The Essential Drucker: The Best of Sixty Years of Peter Drucker’s Essential Writings on Management, Peter Drucker analyzed the ways that management practices and principles affect the performance of organizations, individuals, and society. The book covers the basic principles of management and gives professionals the tools to perform the tasks that the environment of tomorrow will require of them.

These are my takeaways from reading the book.

In the chapter “Know Your Time,” Drucker discussed the three-step time management process as the foundation of executive effectiveness.

  • Recording time
  • Managing time
  • Consolidating time

For recording the time uses, Drucker suggested three diagnostic questions that every knowledge worker should ask themselves. Those questions deal with the unproductive and time-consuming activities over which every knowledge worker has some control.

Drucker also believed that managers need to be equally concerned with time-loss that results from poor management and deficient organization. He outlined the four major time-wasters caused by management and organizational deficiency.

  1. Organizational time-wasters result from a lack of system or foresight. Drucker suggested that the symptom to look for is the recurrent “crisis,” the crisis that comes back year after year. An organization should always have foreseen recurrent crisis. It can either be prevented or reduced to a routine that clerks can manage. The recurrent crisis is simply a symptom of carelessness and laziness.
  2. Time-waste results from overstaffing. If the senior leaders in the group spend more than a small fraction of their time (perhaps one-tenth suggested by Drucker) on “problems of human relations,” on feuds and frictions, or jurisdictional disputes and questions of cooperation, it is a clear indication that the workforce may be too large. In those situations, people get into each other’s way and become an impediment to performance.
  3. Another common time-waster is what Drucker called “mal-organization.” One major symptom is an excess of meetings. Drucker asserted that meetings are, by definition, a concession to the deficient organization. We have time to either meet or work, but not both at the same time. But, above all, meetings must be the exception rather than the rule. Managers should never allow meetings to become the main demand on a knowledge worker’s time. Too many meetings are always indicative of poor job structure or ineffective organizational components.
  4. The last major time-waster is a malfunction in information. In this case, the required information for the work does not flow to where it is needed, resulting in duplicated work or missed opportunities.

Eliminating time-wasting management defects can be fast or slow. The results of such diligence can yield dividends for the affected groups or an entire organization.

Web Scraping of NeurIPS Proceedings Using R

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping R code leverages the rvest package.

INTRODUCTION: The Neural Information Processing Systems Conference (NeurIPS) hosts its collections of papers on the website, https://papers.nips.cc/. This web scraping script will automatically traverse through the listing and individual paper pages of the 2016 conference and collect all links to the PDF documents. The script will also download the PDF documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-29-2016

The source code and JSON output can be found here on GitHub.

How to get into a Famous College

In his podcast, Akimbo, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth helps us think about the college admission process and the approach of positioning ourselves for our desired opportunity. We can leverage a similar tactic for going after a job or a project that we would like to be part of.

We can think of the college admission process as a game. There are rules, explicit or unspoken, for getting into the ranked college that we desire. As it turned out, college ranking is also a game, and schools often manipulate those elements to improve their ranking. The process of picking the students might not be fair, might not be accurate, and might even not be right, but it is a game.

Colleges have only a few metrics to work with on granting admission to students. Many schools focus on primarily academic and testing achievements, but there is another door. The schools reserve that door for people who would be considered to be interesting candidates by the colleges. Those candidates are people who would end up becoming extraordinary alumni. They would make the classroom and learning environment interesting. Most of all, these are the people who were going to be “people like us” when the school wanted to brag about their student body.

The first half of the approach is to teach our children to focus on non-routine activities. These activities take advantage of the fact the teenagers are smart and have plenty of spare time. They take advantage of the fact that a teenager can be goal-directed and figure out how to start something. The activities also show a teenager who is generous and can figure out how to connect with others, how to organize, and how to make change happen.
In the end, these activities demonstrate clearly that the student is the kind of self-directed person who can get something done if she cares about it.

Another way is to work on ideas and figure out the areas where the student wants to do research. Focus on ideas that involve experimentation and methodical approach. The students do not need to work on ideas that require a lot of money, just those that require effort and intelligence. Doing research is one route that someone who is passionate about making changes can take.

The second half of this approach is to figure out how to let the school know that this is our path. Find someone at the college that we choose, like a professor. Study the professor’s work and decide whether we are truly interested in their work. If so, correspond with the professor about that work. Go deep into the work. Try to understand what they are trying to teach and asked interesting questions. Keep the correspondence going, so it is mutually useful for both parties. If possible, help the professor make connections. After months of these correspondences, it is quite appropriate that we can show our interest in the school and ask for a referral. The credibility and trust we have built in the working relationship with the professor might propel us to the opportunity we are seeking.

In the end, the journey and effort will be worth it. It is worth it to become the kind of person who organizes something or builds projects that might not work. By taking the initiative, we are not acting like a cog in the system. It is also helpful for kids to become self-directed, generous individuals who can easily prioritize and able to navigate the adult world without being filled with fear. That is what we get to do as parents are to figure out how to create that environment, where our kids eagerly become capable individuals trying to find an organization where they can make a difference.

The same math is true that when we show up at an interview with a resume. Our resume is just like the test score, and the HR department is the admissions office. The HR department is much more concerned with filling slots with the cheapest, competent people they can find.

On the other hand, if we can build a body of work that is irresistible, generous, remarkable, and game-changing, people will call us. They will call us because there are some jobs where they need somebody who possesses the skills demonstrated in our work. They will want us to help them by doing the same thing for them.

We have so many degrees of freedom available to us these days, and too often we let the prevailing power structure of the culture dictate what we do next. When we spend a quarter of a million dollars and four best years of our life working our way through an institution, we have a safe space. In that space, we get a chance to act in ways that scare us but can create positive value for the people around us. And when our students leave that institution, they are ready to walk into the world not as a cog, but as a leader.

Binary Classification Model for Caravan Insurance Marketing Using R Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.

The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.

The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?

In iteration Take1, we had algorithms with high accuracy but with strong biases due to the imbalance of our dataset. For this iteration, we will examine the feasibility of using the SMOTE technique to balance the dataset.

ANALYSIS: From the Take1 iteration, the baseline performance of the seven algorithms achieved an average ROC score of 0.6965. Two algorithms, Decision Tree and Random Forest, achieved the top two ROC scores after the first round of modeling. After a series of tuning trials, Random Forest yielded the top result using the training data. It achieved a ROC score of 0.7159. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with a ROC score of 0.5285, which was significant below the result from the training data.

From the current iteration, the baseline performance of the seven algorithms achieved an average ROC score of 0.6965. Two algorithms, Decision Tree and Random Forest, achieved the top two ROC scores after the first round of modeling. After a series of tuning trials, Random Forest yielded the top result using the training data. It achieved a ROC score of 0. 9243. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with a ROC score of 0.5746, which was significant below the result from the training data.

CONCLUSION: For this iteration, the SMOTE technique improved the unbalanced dataset we have but did not improve the algorithm’s final performance metric. Overall, the Random Forest algorithm achieved the leading ROC scores using the training dataset, but the model failed to perform adequately using the validation dataset. For this dataset, Random Forest still should be considered for further modeling and testing before making it available for production use.

Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)

One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge

The HTML formatted report can be found here on GitHub.