Web Scraping of Data.gov Dataset Catalog Using Python and BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the BeautifulSoup module.

INTRODUCTION: Data.gov is a government data repository website managed and hosted by the U.S. General Services Administration. The purpose of this exercise is to practice web scraping by gathering the dataset entries from Data.gov’s web pages. This iteration of the script automatically traverses the web pages to capture all dataset entries and store all captured information in a JSON output file.

Starting URLs: https://catalog.data.gov/dataset

The source code and HTML output can be found here on GitHub.

Binary Classification Deep Learning Model for Cats and Dogs Using Keras Take 6

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The “Cats and Dogs” dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Web services are often protected with a challenge that’s supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). ASIRRA (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately.

The current literature suggests that machine classifiers can score above 80% accuracy on this task. Therefore, ASIRRA is no longer considered safe from attack. Kaggle created a contest to benchmark the latest computer vision and deep learning approaches to this problem. The training archive contains 25,000 images of dogs and cats. We will need to train our algorithm on these files and predict the correct labels for the test dataset.

In iteration Take1, we constructed a simple VGG convolutional model with 1 VGG block to classify the images. This model serves as the baseline for the future iterations of modeling.

In iteration Take2, we constructed a simple VGG convolutional model with 2 VGG blocks to classify the images. The additional modeling enabled us to improve our baseline model.

In iteration Take3, we constructed a simple VGG convolutional model with 3 VGG blocks to classify the images. The additional modeling enabled us to improve our baseline model further.

In iteration Take4, we applied dropout to our 3-VGG model. The addition of the dropout layers improved our model.

In iteration Take5, we applied image data augmentation to our VGG-3 model. The addition of the image data augmentation improved our model.

In this iteration, we will apply both dropout layers and image data augmentation to our VGG-3 model. We hope the addition of both techniques will further improve our model.

ANALYSIS: In iteration Take1, the performance of the Take1 model achieved an accuracy score of 95.55% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 72.99% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only ten epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take2, the performance of the Take2 model achieved an accuracy score of 97.94% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 75.67% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only seven epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take3, the performance of the Take3 model achieved an accuracy score of 97.14% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 80.19% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only six epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take4, the performance of the Take4 model achieved an accuracy score of 86.92% after training for 50 epochs. The same model, however, processed the test dataset with an accuracy of 81.04% after 50 epochs. By reviewing the plot, this iteration indicated to us that having dropout layers can be a good tactic to improve the model’s predictive performance.

In iteration Take5, the performance of the Take5 model achieved an accuracy score of 87.52% after training for 50 epochs. The same model, however, processed the test dataset with an accuracy of 85.12% after 50 epochs. By reviewing the plot, this iteration indicated to us that having image data augmentation can be a good tactic to improve the model’s predictive performance.

In this iteration, the performance of the Take6 model achieved an accuracy score of 88.60% after training for 200 epochs. The same model, however, processed the test dataset with an accuracy of 87.25% after 200 epochs. By reviewing the plot, this iteration indicated to us that having both dropout layers and image data augmentation can create a low-variance model that does not overfit too early in the modeling process.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow did not achieve a comparable result with the Kaggle competition. We should explore and consider more and different modeling approaches.

Dataset Used: Cats and Dogs Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.microsoft.com/en-us/download/details.aspx?id=54765

One potential source of performance benchmarks: https://www.kaggle.com/c/dogs-vs-cats/overview

The HTML formatted report can be found here on GitHub.

Time Series Model for Measles Cases in New York Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Measles Cases in New York City dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly number of measles cases in New York City. The dataset describes a time-series of measles cases over 44 years (1928-1972), and there are 534 observations. We used the first 80% of the observations for training and testing various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 304. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (2, 0, 2) with the seasonal order being (2, 0, 1, 12). Furthermore, the chosen model processed the validation data with an RMSE of 325, which was no better than the baseline model.

CONCLUSION: For this dataset, the chosen ARIMA model did not achieve a satisfactory result. We should explore different sets of ARIMA parameters and conduct further modeling activities.

Dataset Used: Monthly reported number of cases of measles, New York City, 1928-1972

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0. https://pkg.yangzhuoranyang./tsdl/.

The HTML formatted report can be found here on GitHub.

Binary Classification Deep Learning Model for Cats and Dogs Using Keras Take 5

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The “Cats and Dogs” dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Web services are often protected with a challenge that’s supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). ASIRRA (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately.

The current literature suggests that machine classifiers can score above 80% accuracy on this task. Therefore, ASIRRA is no longer considered safe from attack. Kaggle created a contest to benchmark the latest computer vision and deep learning approaches to this problem. The training archive contains 25,000 images of dogs and cats. We will need to train our algorithm on these files and predict the correct labels for the test dataset.

In iteration Take1, we constructed a simple VGG convolutional model with 1 VGG block to classify the images. This model serves as the baseline for the future iterations of modeling.

In iteration Take2, we constructed a simple VGG convolutional model with 2 VGG blocks to classify the images. The additional modeling enabled us to improve our baseline model.

In iteration Take3, we constructed a simple VGG convolutional model with 3 VGG blocks to classify the images. The additional modeling enabled us to improve our baseline model further.

In iteration Take4, we applied dropout to our 3-VGG model. The addition of the dropout layers improved our model.

In this iteration, we will apply image augmentation to our 3-VGG model. We hope the addition of the image data augmentation will improve our model.

ANALYSIS: In iteration Take1, the performance of the Take1 model achieved an accuracy score of 95.55% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 72.99% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only ten epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take2, the performance of the Take2 model achieved an accuracy score of 97.94% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 75.67% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only seven epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take3, the performance of the Take3 model achieved an accuracy score of 97.14% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 80.19% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only six epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take4, the performance of the Take4 model achieved an accuracy score of 86.92% after training for 50 epochs. The same model, however, processed the test dataset with an accuracy of 81.04% after 50 epochs. By reviewing the plot, this iteration indicated to us that having dropout layers can be a good tactic to improve the model’s predictive performance.

In this iteration, the performance of the Take5 model achieved an accuracy score of 87.52% after training for 50 epochs. The same model, however, processed the test dataset with an accuracy of 85.12% after 50 epochs. By reviewing the plot, this iteration indicated to us that having image data augmentation can be a good tactic to improve the model’s predictive performance.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow did not achieve a comparable result with the Kaggle competition. We should explore and consider more and different modeling approaches.

Dataset Used: Cats and Dogs Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.microsoft.com/en-us/download/details.aspx?id=54765

One potential source of performance benchmarks: https://www.kaggle.com/c/dogs-vs-cats/overview

The HTML formatted report can be found here on GitHub.

Binary Classification Deep Learning Model for Cats and Dogs Using Keras Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The “Cats and Dogs” dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: Web services are often protected with a challenge that’s supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). ASIRRA (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately.

The current literature suggests that machine classifiers can score above 80% accuracy on this task. Therefore, ASIRRA is no longer considered safe from attack. Kaggle created a contest to benchmark the latest computer vision and deep learning approaches to this problem. The training archive contains 25,000 images of dogs and cats. We will need to train our algorithm on these files and predict the correct labels for the test dataset.

In iteration Take1, we constructed a simple VGG convolutional model with 1 VGG block to classify the images. This model serves as the baseline for the future iterations of modeling.

In iteration Take2, we constructed a simple VGG convolutional model with 2 VGG blocks to classify the images. The additional modeling enabled us to improve our baseline model.

In iteration Take3, we constructed a simple VGG convolutional model with 3 VGG blocks to classify the images. The additional modeling enabled us to improve our baseline model further.

In this iteration, we will apply dropout to our 3-VGG model. We hope the addition of the dropout layers will improve our model.

ANALYSIS: In iteration Take1, the performance of the Take1 model achieved an accuracy score of 95.55% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 72.99% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only ten epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take2, the performance of the Take2 model achieved an accuracy score of 97.94% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 75.67% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only seven epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In iteration Take3, the performance of the Take3 model achieved an accuracy score of 97.14% after training for 20 epochs. The same model, however, processed the test dataset with an accuracy of only 80.19% after 20 epochs. Reviewing the plot, we can see that the model was starting to overfit the training dataset after only six epochs. We will need to explore other modeling approaches to reduce the over-fitting.

In this iteration, the performance of the Take4 model achieved an accuracy score of 86.92% after training for 50 epochs. The same model, however, processed the test dataset with an accuracy of 81.04% after 50 epochs. By reviewing the plot, this iteration indicated to us that having dropout layers can be a good tactic to improve the model’s predictive performance.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow did not achieve a comparable result with the Kaggle competition. We should explore and consider more and different modeling approaches.

Dataset Used: Cats and Dogs Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://www.microsoft.com/en-us/download/details.aspx?id=54765

One potential source of performance benchmarks: https://www.kaggle.com/c/dogs-vs-cats/overview

The HTML formatted report can be found here on GitHub.

Seth Godin Akimbo: Operating Systems

In his Akimbo podcast, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth discusses the purposes of an operating system and what we need to do to make it work for us.

The operating system is a series of rules, approaches, and ways that software can work with one another. If we are going to create an operating system that we want others to adopt and use, we have a lot of responsibility. If we need to work with an operating system, we need to understand how it.

Cities are one of the oldest operating systems in the history of humanity. A city is built on a series of rules. Without those rules, the citizens would find it very difficult to live within that city.

Stewart Brand, the founding editor of the Whole Earth catalog, pointed out that, if we look at a map of Boston from 1750, 1850, and 1950, the buildings all changed. Interestingly, the roads remain largely the same. The roads connect all buildings to one another and define the operating system of the city.

If we were to create an operating system of something, we would be creating a platform that others can plug their ideas into the platform and make the ideas come to life or become better. Like many other successful ideas, a successful operating system makes a significant profit by defining the rules.

There are three ways an operating system could be defined. It can be closed. In a closed operating system, the maker of the system makes all the rules like Apple’s iOS. It can be completely open, like the Linux operating system, where we can see every line of code or even compile our version of it. The third way is somewhere in between the first two.

For an operating system like a city, many rules are not governed by nature. People need to come together to define how a city work. Many decisions for a city are related to public work. Transportation is a critical decision area for many cities as they try to determine where to build roads for cars and tracks for trains. Citizens of the city began to make choices.

For many technologies we work with, we do not have many choices for operating systems. Unlike a piece of software, if we do not like a piece of software, we can buy another. But once we have committed to an operating system, it becomes much difficult to separate from it. Most operating systems are closed by nature.

Systems do not last forever as other newer systems and technologies continue to push forward and impact them. Open systems tend to be more flexible and resilient. Closed systems tend to be less flexible and more prone to be hacked or disrupted by more clever or superior solutions.

What we need to do is to take a hard look at the invisible operating systems all around us. Those operating systems of our future could very well be creeping up around us. If we are not seeing them, we cannot define them or push them to be better. Or worse, those operating systems may not work with our best interests at heart, and the results could be done to us without us being able to make another choice.

要到最佳點嗎?

(從我一個尊敬的作家,賽斯·高汀

一旦有競爭能力的人開始衡量某事,就有壓力使其變得更好。並且一旦達到最高水平,它就便是成最佳狀態。

但這也許不是真正的目標。

我們是否該考慮有無彈性?

也許我們可以更看重令人愉悅,無壓力,或是更可靠的事物。

最佳點最終是毫無色彩。它沒有任何餘地去包括其他的感覺,像喜悅。

Web Scraping of Machine Learning Mastery Blog Using Python and Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Dr. Jason Brownlee’s Machine Learning Mastery hosts its tutorial lessons at https://machinelearningmastery.com/blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Machine Learning Mastery’s web pages. This iteration of the script automatically traverses the web pages to capture all blog entries and store all captured information in a JSON output file.

Starting URLs: https://machinelearningmastery.com/blog

The source code and HTML output can be found here on GitHub.