Web Scraping of NeurIPS Proceedings Using Python and BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: The Neural Information Processing Systems Conference (NeurIPS) hosts its collections of papers on the website, https://papers.nips.cc/. This web scraping script will automatically traverse through the listing and individual paper pages of the 2015 conference and collect all links to the PDF documents. The script will also download the PDF documents as part of the scraping process.

Starting URLs: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-28-2015

The source code and JSON output can be found here on GitHub.

看看你面前的人

(從我的一個喜歡與尊敬的作家,賽斯 高汀

當受我們服務的人來展示他們自己的時候,當他們向我們提供他們的關注和信任的時候,我們需要努力的去了解兩件事:

(一)他們現在是誰:他們害怕什麼?他們相信什麼?他們需要什麼?

(二)他們可以成為誰:我們可以幫他們打開哪些門?我們可以如何去支持他們?他們會想留下什麼樣的成就或腳印?

Binary Classification Model for Caravan Insurance Marketing Using Python Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.

The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.

The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?

In iteration Take1, we had algorithms with high accuracy but with strong biases due to the imbalance of our dataset. For this iteration, we will examine the feasibility of using the SMOTE technique to balance the dataset.

ANALYSIS: From the previous Take1 iteration, the baseline performance of the ten algorithms achieved an average F1_Micro score of 0.9260. Two algorithms, Logistic Regression and Support Vector Machine, achieved the top F1_Micro scores after the first round of modeling. After a series of tuning trials, Support Vector Machine turned in the top result using the training data. It achieved an F1_Micro score of 0.9402. After using the optimized tuning parameters, the Support Vector Machine algorithm processed the validation dataset with an F1_Micro score of 0.9405, which was slightly better than using the training data.

From the current iteration, the baseline performance of the eight algorithms achieved an average F1_Micro score of 0.9326. Two algorithms, Random Forest and Extra Trees, achieved the top F1_Micro scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved an F1_Micro score of 0.9595. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with an F1_Micro score of 0.9165, which was noticeably worse than using the training data and perhaps due to overfitting.

CONCLUSION: For this iteration, the SMOTE technique improved the unbalanced dataset we have but did not improve the algorithm’s final performance metric. Overall, the Random Forest algorithm achieved the leading F1_Micro scores using the training dataset, but the model failed to perform adequately using the validation dataset. For this dataset, Random Forest still should be considered for further modeling and testing before making it available for production use.

Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)

One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge

The HTML formatted report can be found here on GitHub.

Drucker on Recording Our Time

In his book, The Essential Drucker: The Best of Sixty Years of Peter Drucker’s Essential Writings on Management, Peter Drucker analyzed the ways that management practices and principles affect the performance of organizations, individuals, and society. The book covers the basic principles of management and gives professionals the tools to perform the tasks that the environment of tomorrow will require of them.

These are my takeaways from reading the book.

In the chapter “Know Your Time,” Drucker discussed the three-step time management process as the foundation of executive effectiveness.

  • Recording time
  • Managing time
  • Consolidating time

When it comes to manual work, both skilled and unskilled, time management practice generally does not matter greatly. Another word, the difference between time-use and time-waste for the manual work is primarily efficiency and costs.

For knowledge work, the use of time matters increasingly significant. For the knowledge worker and especially of the executive, the difference between time-use and time-waste is effectiveness and results.

Therefore, the first step toward effectiveness is to record actual time-use. Drucker asserted that the specific method used for the recording is not as critical as to getting it done. Just as importantly, we should record our time usage in “real-time” as much as possible, rather than after-the-fact from memory.

Drucker suggested that many effective people keep a time log continually and look at it regularly every month. At a minimum, effective executives should have a log-recording exercise for three to four weeks at a stretch twice a year or so, on a regular schedule. After each sampling, we should take a quick reflection and perhaps rethink and rework our schedule. Often, Drucker observed that many of us invariably find that we have “drifted” into wasting their time on trivial matters.

Fortunately, time-use does improve with practice. However, only constant efforts at managing time can prevent drifting and increase effectiveness. Therefore, systematic time management is the next step. That is, we must find the nonproductive, time-wasting activities and get rid of them if we possibly can.

Drucker suggested we ask ourselves the following diagnostic questions.

  • Identify and eliminate the things that need not be done at all. Or the things that are purely a waste of time without contributing any results. Another word, we ask, “What would happen if this were not done at all?” If the answer is, “Nothing would happen,” the conclusion just became obvious.
  • We should ask, “Which of the activities on my time log could be done by somebody else just as well, if not better?” Many people would customarily use the term “delegation,” but that is a misnomer according to Drucker. We should strive to do the task that we consider important, wanting to do, and committed to doing. The only way we can get to the important things is by getting rid of anything that can be done by somebody else so that we do not have to delegate. From there, we can get to our own work, and that is a major improvement in effectiveness.
  • Can we eliminate time-waste on others that is largely under our control? Often, the way an executive does productive work may still be a major waste of somebody’s else’s time. Effective people must learn to ask systematically and without coyness, “What do I do that wastes your time without contributing to your effectiveness?” To ask such a question, and to ask it without being afraid of the truth, is a mark of the effective executive.

Many of us know all about these unproductive and unnecessary time demands; yet we are afraid to prune them. We are afraid to cut out something important by mistake. But such a mistake, Drucker believed, can be speedily corrected. If we prune too harshly, we usually find out quickly enough.

Web Scraping of Machine Learning Mastery Blog Entries Using Python Take 2

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping code was written in Python 3 and leverages the Scrapy framework maintained by Scrapinghub.

INTRODUCTION: Dr. Jason Brownlee’s Machine Learning Mastery hosts its tutorial lessons at https://machinelearningmastery.com/blog. The purpose of this exercise is to practice web scraping using Scrapy by gathering the blog entries from Machine Learning Mastery’s RSS feed. This iteration of the script automatically traverses the RSS feed to capture all blog entries and store all captured information in a JSON output file.

Starting URLs: https://machinelearningmastery.com/blog

The source code and JSON output can be found here on GitHub.

Drucker on Knowing Our Time

In his book, The Essential Drucker: The Best of Sixty Years of Peter Drucker’s Essential Writings on Management, Peter Drucker analyzed the ways that management practices and principles affect the performance of organizations, individuals, and society. The book covers the basic principles of management and gives professionals the tools to perform the tasks that the environment of tomorrow will require of them.

These are my takeaways from reading the book.

In the chapter “Know Your Time,” Drucker discussed how effective knowledge workers should manage their time.

Effective people know that time is the limiting factor. We cannot rent, hire, buy, or somehow obtain more time.

Time is a unique resource where the supply is inelastic. No matter how high the demand, the supply will not increase. Time is also perishable and cannot be stored like many other resources.

Within limits, we can substitute one resource for another. We can even substitute capital for human labor sometimes, but similar substitution does not work with time.

Since all work takes place in time and uses up time, Drucker believed that nothing else distinguishes effective executives as much as their tender loving care of time.

Despite its importance, Drucker also believed that humans are ill-equipped to manage our time. We are as likely to underrate grossly the time spent doing something as to overrate it grossly.

As the knowledge workers, we have our challenges with the use of time. Most of the tasks of the knowledge worker require, for minimum effectiveness, a large chunk of time. When we spend time in one stretch that is less than this minimum, we rarely accomplish anything useful and must begin all over again.

Moreover, small dribs and drabs of time will not be enough even if the aggregated sum adds up to an impressive number of hours.
This is particularly true concerning time spent working with people, which is a central task in the work of the knowledge worker. People are time-consumers, and most people are time-wasters.

Before an effective person can manage his time, he first must know where it goes. We start by finding out where our time goes; then we attempt to manage our time and to cut back unproductive demands on our time. At last, we consolidate our “discretionary” time into the largest possible contiguous units for effective uses.

This three-step process, as Drucker believed, is the foundation of executive effectiveness:

  • Recording time
  • Managing time
  • Consolidating time

Binary Classification Model for Caravan Insurance Marketing Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.

The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.

The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?

ANALYSIS: The baseline performance of the seven algorithms achieved an average ROC score of 0.6965. Two algorithms, Decision Tree and Random Forest, achieved the top two ROC scores after the first round of modeling. After a series of tuning trials, Random Forest yielded the top result using the training data. It achieved a ROC score of 0.7159. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with a ROC score of 0.5285, which was significant below the result from the training data.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the leading ROC scores using the training and validation datasets. For this dataset, the Random Forest algorithm does not appear to be adequate for production use. Further modeling and testing are recommended for the next step.

Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)

One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge

The HTML formatted report can be found here on GitHub.

Web Scraping of Machine Learning Mastery Blog Entries Using Python Take 1

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping code was written in Python 3 and leverages the Scrapy framework maintained by Scrapinghub.

INTRODUCTION: Dr. Jason Brownlee’s Machine Learning Mastery hosts its tutorial lessons at https://machinelearningmastery.com/blog. The purpose of this exercise is to practice web scraping using Scrapy by gathering the blog entries from Machine Learning Mastery’s web pages. This iteration of the script automatically traverses the web pages to capture all blog entries and store all captured information in a JSON output file.

Starting URLs: https://machinelearningmastery.com/blog

The source code and JSON output can be found here on GitHub.