Regression Model for Metro Interstate Traffic Volume Using Python Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Metro Interstate Traffic Volume dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset captured the hourly measurement of Interstate 94 Westbound traffic volume for MN DoT ATR station 301. The station is roughly midway between Minneapolis and St Paul, MN. The dataset also included the hourly weather and holiday attributes for assessing their impacts on traffic volume.

In iteration Take1, we established the baseline mean squared error without much of feature engineering. This round of modeling also did not include the date-time and weather description attributes.

In iteration Take2, we included the time stamp feature and observed its effect on improving the prediction accuracy.

In this iteration, we will re-engineer (scale and/or discretize) the weather-related features and observe their effect on the prediction accuracy.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 2646. Two algorithms (K-Nearest Neighbors and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an RMSE metric of 1887. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 1878, which was even better than the prediction from the training data.

From iteration Take2, the performance of the machine learning algorithms achieved an average RMSE of 1559. Two algorithms (Random Forest and Extra Trees) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an RMSE metric of 465. By using the optimized parameters, the Random Forest algorithm processed the test dataset with an RMSE of 461, which was slightly better than the prediction from the training data.

By including the date_time information and related attributes, the machine learning models did a significantly better job in prediction with a much lower RMSE.

In the current iteration, the performance of the machine learning algorithms achieved an average RMSE of 977. Two algorithms (Random Forest and Extra Trees) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an RMSE metric of 465. By using the optimized parameters, the Random Forest algorithm processed the test dataset with an RMSE of 462, which was slightly better than the prediction from the training data.

By re-engineering the weather-related features, the average performance of all models did better. However, the changes appeared to have no impact on the performance of the ensemble algorithms, including Random Forest.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.

Dataset Used: Metro Interstate Traffic Volume Data Set

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

One potential source of performance benchmarks: https://www.kaggle.com/ramyahr/metro-interstate-traffic-volume

The HTML formatted report can be found here on GitHub.

Drucker on Knowledge Worker Productivity, Part 1

In his book, Management Challenges for the 21st Century, Peter Drucker analyzed and discussed the new paradigms of management.

Although much of the discussion revolves around the perspective of the organization, these are my takeaways on how we can apply his teaching on our journey of being a knowledge worker.

In this chapter, Drucker discussed worker productivity for both manual work and knowledge work. In the 20th century, businesses focused on manual workers’ productivity and reaped a tremendous amount of benefits.

In the 21st century, Drucker believed the most valuable asset of any institution, whether for or non-profit, will be its knowledge workers and their productivity.

According to Drucker, six factors can influence the knowledge worker’s productivity.

1. Knowledge workers’ productivity demands that we ask the question: “What is the task?”

While manual work productivity asks the question of “How can we do something faster and more cheaply,” the question for knowledge work should be “Who is this for and why do we want to do it?”

2. Knowledge workers must own responsibility for their productivity. Another word, knowledge workers must manage themselves with autonomy.

For manual work, the bosses and managers own the primary responsibility of keeping their worker productivity. It is the opposite of knowledge work.

3. Continuing innovation must be part of the knowledge work.

For manual productivity, doing things faster and producing things cheaper per unit are the goal. For knowledge workers, our work must be innovative by being purposeful and aiming at a leadership posture.

4. Knowledge work requires continuous learning, and equally continuous teaching on the knowledge worker’s part.

Manual worker’s productivity relies on obedience and compliance.

5. Quantity of output is not the primary productivity concern for knowledge workers.

For manual productivity, it is mostly about the quantity of output. The quality aspect of the manual work is to meet a minimum standard. Exceeding such minimum standard is welcome but not essential for manual work.

On the other hand, the productivity of knowledge work must aim first at obtaining the optimum, if not maximum, quality. Only after achieving the quality goal, we can ask ourselves the question of quantity or volume. This quality-first posture also means that the knowledge workers must think through the definition of quality for our work.

6. Finally, Drucker asserted that the productivity of knowledge workers requires that we treat the people as an “asset” rather than a “cost.”

Time Series Model for Australian Resident Population Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Australian Resident Population dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast quarterly resident population in Australia. The dataset describes the numbers (in thousands) of Australian residents measured quarterly from March 1971 to March 1994, and there are 89 observations. We used the first 75% of the observations for training and testing various models, while holding back the last 25% of the observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 60.17. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (0, 2, 1) with the seasonal order being (0, 0, 0, 0). Furthermore, the chosen model processed the validation data with an RMSE of 14.74, which was better than the naive model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result and should be considered for further modeling.

Dataset Used: Quarterly Totals of Australian Resident Population

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: https://www.alkaline-ml.com/pmdarima/modules/generated/pmdarima.datasets.load_austres.html

The HTML formatted report can be found here on GitHub.

What’s in the Fridge?

In his podcast, Akimbo, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast and using food refrigeration and handling as an example, Seth discusses system thinking and the opportunities available to us for a better design.

Seth began with the following observations. First, the refrigerator is wasting a lot of power every day. Two, the marketing industrial complex has pushed us to do more and more food shopping than what we need. Third, we waste a lot of food. Four, a lot of us are fat. One of the reasons for obesity is we are eating out a lot. We mindlessly eat that food with a lot more fat, salt, and sugar. Next, we are unaware of what we are consuming. Finally, the refrigerator is a symbol of the miracle of food supply chain in our era. The convenience and a vast selection any time we want also created a lot of waste.

But the system is inefficient. It wastes energy. It wastes time. It lets us fall into so many traps of being manipulated by the marketing industrial complex into emotional and mindless eating. The whole idea of the food chain comes to a single point in the refrigerator in our home.

What would happen if we reinvented the fridge? Here are some ideas.

Number one, we can put scanners in the fridge. The scanner can recognize the food items either via a bar code or its appearance. The refrigerator knows when we put something in the fridge, what item it is, and how long it has been in the fridge. It also can make a good guess as to how much is in it, how much it weighs, and how much we take out.

Number two, if the refrigerator knows what food we have, it should be able to display the inventory without needing us to open the fridge door. The inventory could sort the items based on their projected expiration dates. A smart refrigerator would also recognize the connections between food. It knows when something is not in balance, too much cereal and not enough milk for example.

Number three, what if the fridge is linked to my phone and it knows we are in the supermarket? The fridge can send us a message and suggest the items we should consider buying or restocking.

Number four, if my refrigerator knows what we usually buy, why doesn’t the fridge start suggesting to us what might be for dinner tonight? It can come up with efficient ways to use things in the fridge that need to be used up. It can find new combinations of items that will surprise and even delight us based on what we eat. It can come up with recipes that match our preferences in terms of how much time we have, who is coming over, or how many calories we have eaten today.

Furthermore, can our food buying habits be coordinated with local supermarkets’ procurement and sales activities. Instead of guessing what consumers might buy and overstock, can our refrigerators provide information to the supermarkets and local farmers for just-in-time inventory practice? Since we are each consuming tens of thousands of dollars’ worth of food a year, can the wholesale food process be more efficient?

There are four factors to keep in mind as we think about the new system. The first factor is the connections that exist between each one of us, our devices, and external organizations.

The second factor is the status implication. We are constantly thinking about how we compare to others. Our status matters to us. The third factor is convenience because we are now forever hooked on convenience. We want to do the thing that is faster and easier.

The last is metric. What happens when we turn our fridge into a score where we could win by being more efficient? We could win by cutting our food bill. We could win by spending less time cooking even better things. Keeping scores (gamification) could be a key part of both status and convenience.

The reality of the business of food is that people eat every day. It is one of very few industries where we do not have to create demand. We think they’re invisible and permanent, but they are not. The real wins are going to happen when we rewire the systems that are so ubiquitous.

Regression Model for Metro Interstate Traffic Volume Using R Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Metro Interstate Traffic Volume dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset captured the hourly measurement of Interstate 94 Westbound traffic volume for MN DoT ATR station 301. The station is roughly midway between Minneapolis and St Paul, MN. The dataset also included the hourly weather and holiday attributes for assessing their impacts on traffic volume.

In iteration Take1, we established the baseline mean squared error without much of feature engineering. This round of modeling also did not include the date-time and weather description attributes.

In this iteration, we will include the time stamp feature and observe its effect on the prediction accuracy.

ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 2099. Two algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an RMSE metric of 1895. After applying the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an RMSE of 1899, which was slightly better than the prediction from the training data.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 972. Two algorithms (Random Forest and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an RMSE metric of 480. After applying the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an RMSE of 479, which was slightly better than the prediction from the training data.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall training and validation results. For this dataset, the Random Forest algorithm could be considered for further modeling.

Dataset Used: Metro Interstate Traffic Volume Data Set

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

One potential source of performance benchmarks: https://www.kaggle.com/ramyahr/metro-interstate-traffic-volume

The HTML formatted report can be found here on GitHub.

Web Scraping of RSA Conference USA 2019 Using R

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping R code leverages the rvest package.

INTRODUCTION: The RSA Conference is a go-to resource for exchanging ideas, learning the latest trends, and finding the solutions in the field of cybersecurity. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://www.rsaconference.com/events/us19/presentations?type=presentations

The source code and HTML output can be found here on GitHub.

兩種系統風險

(從我一個尊敬的作家,賽斯·高汀

在設置一個系統時,我們要瞭解如果這系統不工作時,會有什麼後果。根據“不工作”時的成本,你可以在系統中建立更強的伸縮性。

在大多數情況下,“不工作”並非就是災難性的。如果你的烤麵包機壞了,那也沒什麼大不了。你可以在幾天後才烤吐司,之前與軟麵包一起生活。另一方面,如果你正在執行飛行火星的任務,你可能會很高興你多裝了一些氧氣罐,即使多裝的成本非常高。

時常我們在組織系統時會犯了兩個錯誤:

一)我們對系統的可靠性過於樂觀,並且大度數地降低了沒有它的生活成本。就像我們把這互聯網基礎設施的當前狀態放在這個陣營中。

二)我們對失敗的可能性和成本過於悲觀。這導致我們過度的設計,或者為安全而做出昂貴的投資。把游泳救生衣放在飛機上就是一個很好的例子。在寫作上避免一個錯字也是如此。這也是我們的醫療成本如此之高的一個原因,最後的百分之零點零一也是最昂貴的部分。

行政決策的一項有用技能是能夠以客觀的方式來描述伸縮性和失敗成本,特別是在很難做到的時候。