Web Scraping of O’Reilly Software Architecture Conference New York 2019

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: On occasions we have a need to download a batch of documents off a single web page without clicking on the download link one at a time. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the PDF and PPTX documents as part of the scraping process.

Starting URLs: https://conferences.oreilly.com/software-architecture/sa-ny-2019/public/schedule/proceedings

The source code and HTML output can be found here on GitHub.

學與做的差距

(從我一個尊敬的作家,賽斯 高汀

我們的社會把這兩件事情分開。在某個階段,我們還認為其中一件會干擾另一件事。

醫生要上完八年的學校才能正式的行醫。大部分時間,你只是在學習當個醫生,而不是真正的去做一位醫生。

當一位撰稿人在上班。大多數時候,你只是在寫作,而不是學習新的寫作方式。

我們所為的“學習”事實上更多的是關於接受“教育”。那教育只圍繞著合規性,排名和“這是否會在考卷上出現?”

功課好與真正的學習東西是不同的。

我們沒有將實習納入教育的一個原因,是它會使那些主張光講授和指導的人失去他們的權威。

目前,美國的K-12(義務教育)系統中有五千六百萬人。他們大多數人整天都沒有做任何事情,只是上學,他們沒有將現實生活中的一些活動,實驗和互動帶入他們的學習中。

在美國,每天有超過一億的人上班,但之中有很少的人經常去閱讀書籍,或繼續上進,或去了解如何更好地開展工作。這些都被認為是會分散注意力,或者充其量是不方便或是浪費的時間。

這學與做差距是很明顯的。專業人士通常需要十年或更長時間才能接受並學習新的方法。像胃腸病學家花了很多時間才能慢慢的接受大多數潰瘍是由細菌引起並且改變他們醫療的方法。我們的司法系統花了三十多年才慢慢的更正我們當如何嚴格的審視與判決。

可能是因為我們把學習與教育混為一談。我們的教育不成功是由其他人來負責,這也是讓權力由他人來操縱。

如果我們所做的學習始終是通過與我們的工作來相互結合,那麼會變成什麼?

如果我們認真看待自己的行為並花時間從中真正學到一些東西,那麼又會發生什麼?

當警察部門花時間研究他們的數字和調查新方法時,他們發現效率和生產力提高,安全性提高,工作滿意度也提高。

當理科學生設計和操作他們自己的實驗室測試時,他們對工作的理解會大大提高。

傳統教育(一種基於合規的系統)正在發生巨大的變化,這與那些已經通過連接重建並利用互聯網帶來的其他行業一樣大。然而,太多新的教育工作還只是去找更有效的方式來提供講座和測試。

我每天都看到這一點。人們出現在Akimbo網站上期待終生訪問和秘密視頻,但是他們真正看到的是艱苦但有用的參與工作。

什麼是有效的替代方案呢?那就是真正的學習。學習有包容真正的去做某件事,做出口述,審查和審查,學習相關項目和與同行參與。當我們一起學習和共同努力做事,在同個時候,一項會對另一方面產生影響。

如果你想學習營銷,那就真正的去做營銷。如果你想做營銷,那就去努力以赴去學習營銷。

這種同樣的對稱屬性適用於我們關心的所有事物。

引用一句早期搖滾歌手的話說,“我們不需要……教育。”

但我們可能會從學習中受益。

當我們不斷的做,我們也不斷的在學習,更要去學的如何做的更好。

Regression Model for Ames Iowa Housing Prices Using Python Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Ames Iowa Housing Prices dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Many factors can influence a home’s purchase price. This Ames Housing dataset contains 79 explanatory variables describing every aspect of residential homes in Ames, Iowa. The goal is to predict the final price of each home.

In iteration Take1, we established the baseline mean squared error for further takes of modeling.

In iteration Take2, we converted some of the categorical variables from nominal to ordinal and observed the effects of the change.

In iteration Take3, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting only the most important attributes, we decreased the processing time and maintained a similar level of RMSE compared to the baseline.

In this iteration, we will examine the feature selection technique of recursive feature elimination (RFE) by using the Gradient Boosting algorithm. By selecting up to 50 attributes, we hope to decrease the processing time and maintain a similar level of RMSE compared to the baseline.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 31,172. Two algorithms (Ridge Regression and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the best overall result and achieved an RMSE metric of 24,165. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 21,067, which was even better than the prediction from the training data.

From iteration Take2, Gradient Boosting achieved an RMSE metric of 23,612 with the training dataset and processed the test dataset with an RMSE of 21,130. Converting the nominal variables to ordinal did not have a material impact on the prediction accuracy in either direction.

From iteration Take3, Gradient Boosting achieved an RMSE metric of 24,045 with the training dataset and processed the test dataset with an RMSE of 21,994. At the importance level of 99%, the attribute importance technique eliminated 222 of 258 total attributes. The remaining 36 attributes produced a model that achieved a comparable RMSE to the baseline model. The processing time for Take2 also reduced by 67.90% compared to the Take1 iteration.

From iteration Take4, Gradient Boosting achieved an RMSE metric of 23,825 with the training dataset and processed the test dataset with an RMSE of 21,898. The RFE technique eliminated 208 of 258 total attributes. The remaining 50 attributes produced a model that achieved a comparable RMSE to the baseline model. The processing time for Take3 also reduced by 1.8% compared to the Take1 iteration.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting should be considered for further modeling.

Dataset Used: Kaggle Competition – House Prices: Advanced Regression Techniques

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

One potential source of performance benchmarks: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The HTML formatted report can be found here on GitHub.

Pressfield on the Professional Mindset as a Practice

In his book, Turning Pro, Steven Pressfield teaches us how to navigate the passage from the amateur life to professional practice.

These are my takeaways from reading the book.

According to Pressfield, to “have a practice” is to follow a rigorous, prescribed regimen to elevate the mind and the spirit to a higher level. Pressfield also defined the practice as the dedicated, daily exercise of commitment, will, and focused intention aimed, on one level, at the achievement of mastery in a field.

We should consider setting up our practice with the following elements:

A practice has a space

That space is sacred. We want to encourage the qualities of Order, Commitment, Passion, Love, Intensity, Beauty, and Humility when we practice our work of art.

A practice has a time

When we practice our work, we want to approach it via order, commitment and passionate intention. When we do our work daily in the same space at the same time, powerful energy of intention, dedication, and commitment build up around us.

A practice has an intention

Focused practice is the only way to achieving mastery. Our intention as professionals is to get better and to go deeper into our chosen field.

We come to a practice as warriors

Every time the professional enters the practice space, she knows that will be facing a powerful opponent. That powerful opponent is herself, and she will be battling the demon of Resistance all day long.

We come to a practice in humility

We must bring intention and intensity to our practice, but we leave ego and arrogance behind at the entrance of our workspace.

We come to a practice as students

Even after we achieve “mastery” in our field, we are always learning as a student when we come to the practice field.

A practice is lifelong

For a professional, there is no finish line.

Unlike a project, life is a constant pursuit.

Regression Model for Ames Iowa Housing Prices Using Python Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Ames Iowa Housing Prices dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Many factors can influence a home’s purchase price. This Ames Housing dataset contains 79 explanatory variables describing every aspect of residential homes in Ames, Iowa. The goal is to predict the final price of each home.

In iteration Take1, we established the baseline mean squared error for further takes of modeling.

In iteration Take2, we converted some of the categorical variables from nominal to ordinal and observed the effects of the change.

In this iteration, we will examine the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting only the most important attributes, we hope to decrease the processing time and maintain a similar level of RMSE compared to the baseline.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 31,172. Two algorithms (Ridge Regression and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the best overall result and achieved an RMSE metric of 24,165. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 21,067, which was even better than the prediction from the training data.

From iteration Take2, Gradient Boosting achieved an RMSE metric of 23,612 with the training dataset and processed the test dataset with an RMSE of 21,130. Converting the nominal variables to ordinal did not have a material impact on the prediction accuracy in either direction.

From iteration Take3, Gradient Boosting achieved an RMSE metric of 24,045 with the training dataset and processed the test dataset with an RMSE of 21,994. At the importance level of 99%, the attribute importance technique eliminated 222 of 258 total attributes. The remaining 36 attributes produced a model that achieved a comparable RMSE to the baseline model. The processing time for Take2 also reduced by 67.90% compared to the Take1 iteration.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting should be considered for further modeling.

Dataset Used: Kaggle Competition – House Prices: Advanced Regression Techniques

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

One potential source of performance benchmarks: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The HTML formatted report can be found here on GitHub.

Who is Banksy?

In his podcast, Akimbo, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth discusses how the origin of something, especially for culture, can affect how we perceive its legitimacy and impact.

When we talk about gravity and calculus, we often talk about Isaac Newton. While Newton arguably is one of the most important pioneers in science and math, he spent most of his time working on predicting apocalypse and alchemy, which is not a science at all.

Einstein’s theory of relativity is considered one of the two pillars of modern physics (alongside quantum mechanics). The great Albert Einstein had spent decades of his life refuting quantum mechanics in favor of a unified field theory.

When it comes to science, we generally do not discount the truth of the theories just because of the person who made the discovery or presented the proof. However, the same objectivity cannot be said for art and culture.

When it comes to art and culture, the origin (or who made it) matters a lot to us. Fountain by Marcel Duchamp and The Bachman Books by Stephen King are two examples of how we often view a work of art through the lens of the art’s origin. Our perception of art changes based on who we think the artist is.

One approach for artists who want to produce work with cultural relevance is to have a secret identity. Like Bruce Wayne and Batman, the secret identity of Batman represents an idea, not a person. Batman is an icon. Banksy, or perhaps Robert Gunningham, have been practicing street art anonymously. What is fascinating is that there exists a cult-like following of the artist. It shows how much we enjoy talking about who Banksy might be.

In our culture, we judge the work of art around us all the time. We judge by knowing the origin of where it came from and sometimes through the gatekeepers. Now we are much more connected with a less need of gatekeepers, it matters less where a piece of art came from.

Day after day, we are moving away from asking the question of “Where did this come from?” and moving toward asking “What does this do for me?” Each one of us has the chance to contribute something to the culture, particularly when we were talking to people who are looking for the truth. We should seek out others who are on the same journey and level up our work by generously sharing our work with the community.

Regression Model for Ames Iowa Housing Prices Using Python Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Ames Iowa Housing Prices dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Many factors can influence a home’s purchase price. This Ames Housing dataset contains 79 explanatory variables describing every aspect of residential homes in Ames, Iowa. The goal is to predict the final price of each home.

In iteration Take1, we established the baseline mean squared error for further takes of modeling.

In this iteration, we plan to convert some of the categorical variables from nominal to ordinal and observe the effects of the change.

ANALYSIS: The performance of the machine learning algorithms achieved an average RMSE of 31,025. Two algorithms (Ridge Regression and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the best overall result and achieved an RMSE metric of 23,612. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 21,130, which was even better than the prediction from the training data.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 31,172. Two algorithms (Ridge Regression and Gradient Boosting) achieved the top RMSE metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the best overall result and achieved an RMSE metric of 24,165. By using the optimized parameters, the Gradient Boosting algorithm processed the test dataset with an RMSE of 21,067, which was even better than the prediction from the training data.

From iteration Take2, Gradient Boosting achieved an RMSE metric of 23,612 with the training dataset and processed the test dataset with an RMSE of 21,130. Converting the nominal variables to ordinal did not have a material impact on the prediction accuracy in either direction.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting should be considered for further modeling.

Dataset Used: Kaggle Competition – House Prices: Advanced Regression Techniques

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

One potential source of performance benchmarks: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The HTML formatted report can be found here on GitHub.

Web Scraping of Daines Analytics Blog Using R Take 2

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in R and leveraged the rvest package.

INTRODUCTION: Daines Analytics hosts its blog at dainesanalytics.blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Daines Analytics’ RSS feed. The script automatically traverses the RSS feed to capture all blog entries in a JSON document.

For this second iteration, the script also will store the captured information in a remote relational database.

Starting URLs: https://dainesanalytics.blog/feed or https://dainesanalytics.blog/feed/?paged=1

The source code and JSON output can be found here on GitHub.