“難處在哪裡?”

(從我一個尊敬的作家,賽斯·高汀

這只是一個簡單的問題,但通常會被忽略,好像忽略它就會使這個問題消失。

一切值得做的事情都有難處。如果不難,那早就可以完成了。

我們可以希望如果多分配些資源並集中精力和精力,那麼困難的部分將變得比較容易。這就是我們工作的重點,去減少困難的部分。

但是,如果我們拒絕提問和回答這個問題,那麼我們又怎麼可能去專注於那些重要的事情呢?

專注於不難的部分通常會感覺的更加有趣和輕鬆。或者把困難的部分假裝成很容易去做。

我認為,更好的是,如果我們認為一項工作值得去做,就認真的去做好那值得我們努力的部分。

Web Scraping of O’Reilly Artificial Intelligence Conference 2019 London Using R

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping R code leverages the rvest package.

INTRODUCTION: The O’Reilly Artificial Intelligence (AI) Conference covers the full range of topics in leveraging the AI technologies for developing software applications and creating innovative solutions. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://conferences.oreilly.com/artificial-intelligence/ai-eu/public/schedule/proceedings

The source code and HTML output can be found here on GitHub.

Regression Model for Red vs. White Wine Quality Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Wine Quality dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: The dataset is related to the white variants of the Portuguese “Vinho Verde” wine. The problem is to predict the wine quality using the chemical characteristics of the wine solely. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g., there is no data about grape types, wine brand, wine selling price, etc.).

For the red wine…

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 0.704. Two algorithms (Extra Trees and Random Forest) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an RMSE metric of 0.574. By using the optimized parameters, the Extra Trees algorithm processed the test dataset with an RMSE of 0.563, which was even better than the prediction from the training data.

CONCLUSION: For this iteration, the Extra Trees algorithm achieved the best overall results using the training and testing datasets. For this dataset, Extra Trees should be considered for further modeling.

For the white wine…

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 0.772. Two algorithms (Extra Trees and Random Forest) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an RMSE metric of 0.609. By using the optimized parameters, the Extra Trees algorithm processed the test dataset with an RMSE of 0.586, which was even better than the prediction from the training data.

CONCLUSION: For this iteration, the Extra Trees algorithm achieved the best overall results using the training and testing datasets. For this dataset, Extra Trees should be considered for further modeling.

Dataset Used: Wine Quality Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/wine+quality

The HTML formatted report can be found here on GitHub.

Time Series Model for American River Riverflow Using Python

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The American River Riverflow dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the monthly riverflow for the American River at Fair Oaks, California. The dataset describes a time-series of flow volume (in cms) over 55 years (1906-1960), and there are 660 observations. We used the first 80% of the observations for training and testing various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 90.012. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (1, 0, 0) with the seasonal order being (1, 0, 1, 12). Furthermore, the chosen model processed the validation data with an RMSE of 78.413, which was better than the baseline model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result and should be considered for further modeling.

Dataset Used: Monthly riverflow in cms, American River at Fair Oaks, California, October 1906 through September 1960

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0. https://pkg.yangzhuoranyang./tsdl/.

The HTML formatted report can be found here on GitHub.

Binary Classification Deep Learning Model for MiniBooNE Particle Identification Using Keras

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The MiniBooNE Particle Identification dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background). The data file is set up as follows. In the first line is the number of signal events followed by the number of background events. The records with the signal events come first, followed by the background events. Each line, after the first line, has the 50 particle ID variables for one event.

ANALYSIS: The baseline performance of the model achieved an average accuracy score of 93.62%. After tuning the hyperparameters, the best model processed the training dataset with an accuracy of 93.70%. Furthermore, the final model processed the test dataset with an accuracy of 93.94%, which was consistent with the accuracy result from the training dataset.

CONCLUSION: For this dataset, the model built using Keras and TensorFlow achieved a satisfactory result and should be considered for future modeling activities.

Dataset Used: MiniBooNE Particle Identification Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification

The HTML formatted report can be found here on GitHub.

Binary Classification Model for MiniBooNE Particle Identification Using Python Take 6

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The MiniBooNE Particle Identification dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background). The data file is set up as follows. The first line is the number of signal events followed by the number of background events. The records with the signal events come first, followed by the background events. Each line, after the first line, has the 50 particle ID variables for one event.

For this iteration, we will leverage TPOT, the automated machine learning tool got Python, that optimizes machine learning pipelines using genetic programming.

ANALYSIS: The baseline performance of the machine learning algorithms achieved the best accuracy of 91.11% after generation one. After generation 20, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 92.44%. Furthermore, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 92.83%, which was even better than the prediction result from the training data.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and test datasets. For this dataset, Gradient Boosting should be considered for further modeling.

Dataset Used: MiniBooNE Particle Identification Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification

The HTML formatted report can be found here on GitHub.

The Free Market Has an Enemy

In his Akimbo podcast [https://www.akimbo.me/], Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth discusses the concept of a free market and how it differs from capitalism.

A free market is a place where, the buyers and sellers try to figure out what others want or can provide. We like the opportunities that a free market can provide. In our complex society, the free market is fragile and hardly stable.

Capitalism is the idea that capital money can be invested to build systems and to make things that would improve our productivity. Many people use the terms of free market and capitalism interchangeably. While capitalism can fuel the free market, but it is not the free market.

Capitalism also can lead to a ratchet called progress. But capitalism also comes with three challenges.

The first challenge is that capitalism encourages monopolistic behaviors. A monopoly takes away people’s choices. When we do not have a choice, we must do what the capitalist wants us to do.

The second challenge is that capitalism encourages short-term thinking. The short-term thinking comes from that capitalism measures return on investment, and return on investment is time-based. As a result, the short-term thinking of capitalists combined with short-term thinking of consumers produces an environment where nobody is thinking about the long-term.

The third challenge for capitalism is corruption. Without boundaries and left to its own devices, bad actors in the market will attempt to use any means necessary to get an advantage. The outcome is Gresham’s law, where “bad money drives out good.”

On the other hand, the free market dislikes monopoly because the free market works when people have choices. The free market also does not work well when we make it very difficult to build things for the long haul. It is already difficult to focus on quality and meaningful things that will last. The free market also does poorly with the weight of corruption. When a capitalist acts like a bully who is trying to power its way through the rule and structures of the free market, we all lose.

If we care about choice, investing for the long-term, and making progress without the threat of corruptive influence, we must stand up and defend the free market. Defending the free market is not the same as defending capitalism. Crony capitalism is a selfish act that tries to make the market work for itself and walks away from the very idea of the free market. A free market is about making better things and creating a better future for everyone.

主動權

(從我一個尊敬的作家,賽斯·高汀

要獲得主動權的唯一方法就是自己採取主動,因為那不是他人來給的。

有些人猶豫不決去拿,也許是因為他們擔心這主動權會因某種原因而用完。

主動權不會耗盡。它是一種自我更新的資源。

從我們很小的時候開始,我們大多數人就被教導要避免這種情況。做完你的功課。把垃圾拿出去丟。等待著被挑選。等待著別人與你溝通。做的人見人愛。去適應。也許偶爾做的不太同,但不要太突出。失敗永遠會比沒有嘗試來糟糕得多。

另一種選擇是採取主動行動。去代表那些要你來服務的那群人。

放心去做吧,主動權還多的是。