Regression Model for Online News Popularity Using Python Take 2

Template Credit: Adapted from a template made available byDr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

For this iteration, we will examine the feasibility of using a dimensionality reduction technique of ranking the attribute importance with the Lasso algorithm. Afterward, we will eliminate the features that do not contribute to the cumulative importance of 0.99 (or 99%).

ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 13020. Two algorithms (Linear Regression and ElasticNet) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 11273. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an RMSE of 12089, which was slightly worse than the accuracy of the training data.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 13128. Two algorithms (Linear Regression and ElasticNet) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 11358. Using the optimized tuning parameter available, the ElasticNet algorithm processed the validation dataset with an RMSE of 12146, which was slightly worse than the accuracy of the training data.

From the model-building activities, the number of attributes went from 58 down to 30 after eliminating 28 attributes. The processing time went from 15 minutes 1 second in iteration Take1 up to 17 minutes 37 seconds in iteration Take2, which was due to the additional time required for the feature selection processing.

CONCLUSION: The feature selection techniques helped by cutting down the attributes and yet still retained a comparable level of accuracy. For this dataset, ElasticNet should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The HTML formatted report can be found here on GitHub.

Drucker on Knowing Our Strengths

In his book, The Essential Drucker: The Bestof Sixty Years of Peter Drucker’s Essential Writings on Management, PeterDrucker analyzed the ways that management practices and principles affect the performanceof organizations, individuals, and society. The book covers the basicprinciples of management and gives professionals the tools to perform the tasksthat the environment of tomorrow will require of them.

These are my takeawaysfrom reading the book.

In the chapter “Know Your Strengths andValues,” Drucker discussed why the knowledge workers would have to learn to manage themselves. By managing ourselves, itmeans that we will have to place ourselves where we can make the greatestcontribution. We will have to learn to develop ourselves and stay mentally productiveduring fifty-year working life. We also willhave to learn how and when to change because the environment simply does notstand still.

As a knowledge worker, learning to manage ourselves is critical because our careers are likely to outlive our employing organization. The average working life for knowledge worker likely will be fifty years, especially when more and more people are working well into the so-call “retirement” age. But the average life expectancy of a successful business is only about thirty years, many do not survive even that long. Increasingly, the knowledge workers will outlive any one employer, and will have to be prepared for more than one job, more than one assignment, and more than one career.

To prepare ourselves for the new environment, we need to know our strengths and weaknesses. Contrary to some people’s suggestions, we can only build performance on our strengths, not with our weaknesses.

Drucker suggested that there is only one way to discover our strengths: the feedback analysis. Whenever we make a key decision or take a key action, we write down what we expect will happen. After some months later, we review the results and evaluate our previous belief or expectations.

Within a fairly short period, maybe two or three years, this simple procedure will tell us first where our strengths are. This is the most important thing to know about ourselves. The analysis will show us what we did that gave us the full yield from our strengths. The analysis also will show where we are not particularly competent and places where we have no strengths and cannot perform.

The feedback analysis should yield the following conclusions.

The first conclusion, and the most important, is where to concentrate on our strengths. We need to place ourselves where our strengths can produce performance and results.

Second, we should work on improving your strengths. The feedback analysis can show where we need to improve our skills or to acquire new knowledge. It will show the gaps in our knowledge or places where the existing skills and knowledge are no longer adequate.

Third, we need to identify where intellectual arrogance can cause disabling ignorance. The feedback analysis can show that the main reason for poor performance is the result of simply not knowing enough, or the result of being contemptuous of knowledge outside our specialty. The feedback analysis can also help us remedy our bad habits. The bad habits are things we do or fail to do that inhibit our effectiveness and performance.

In conclusion, Drucker recommended that we waste as little effort as possible on improving areas of low competence. Instead, we should concentrate on areas of high competence and high skill, assuming they are relevant to our objectives and the environment. It takes far more energy and far more work to improve from incompetence to low mediocrity than it takes to improve from first-rate performance to excellence. Unfortunately, most people try to concentrate on turning an incompetent area into a low mediocrity performance. The energy, resources, and time should instead go into enhancing our strengths into a star performance.

Regression Model for Online News Popularity Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 10446. Two algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved the best RMSE of 10299. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an RMSE of 12978, which was slightly worse than the accuracy of the training data and possibly due to over-fitting.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the top training and validation results comparing to other machine learning algorithms. For this dataset, Random Forest should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The HTML formatted report can be found here on GitHub.

This Is Marketing

In his podcast, Akimbo, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

The human society has had many revolutions. Four notable, recent revolutions are:

  • The industrial revolution: With it, we now can make things with more sophistication and better quality every passing year.
  • The computer revolution: With it, we now have computers that can handle many complex scientific/engineering computations works for us.
  • The networked-computer revolution: With it, we can connect all the computers and data to form more data and insights than ever.
  • The marketing revolution: This revolution combines the first three revolutions and changes how we make things, bring things to the market, and change our perception and belief about those things we make. Marketing is all about the changing of the culture.

Marketing is now pervasive than the religions over 100 years ago. Marketing changes how we interact with things/people, what we buy, how we vote, and even how we date. Some ideas presented in the marketing revolution are:

Marketers make change happen. Marketers assert that something they bring in will change someone for the better. What we do as a marketer is to make things better by making better things. We use stories to talk about some product, some service, or some vision that will make an actual, positive impact on other human beings.

Often marketing now is about changing the culture, “People Like Us, Do Things Like this.” We need to figure out who we are trying to serve, because not everyone is “People Like Us.” We also need to figure out what we plan to do (regarding product, service, or vision) because not everything is “Things Like this.”

By focusing on the specific people and things, we can work on identifying the “smallest viable audience.” It is not doing something for the mass. It is also not making average stuff for the average people.

If we can find and engage with the smallest viable audience for the change we are hoping to make, we can sustain the movement of change. We can serve the people who want to be served. See people for whom want to be seen. And connect with people who want to be connected. Once we identified the smallest viable audience, we can now answer the questions of who it is for and what it is for.

When we go to the market, most people do not want to hear from us. Most people will be skeptical. Instead of trying to please the mass, we need to shun the non-believers. If the people we seek to serve do not get what we are doing or do not want to engage with us, we should treat this as a critical lesson and figure out how to improve. Spending energy on the non-believers is a waste.

Bringing about a change is difficult, even when we bring it to the people who want to hear from us in the first place. Changes are risky. It is only a recent occurrence that people can thrive from being a neophiliac, and they are the first set of audience we are trying to serve.

Even marketing to neophiliacs can create tension. Tension is something we all try hard to avoid. Traditional mass marketers try to position their product/service by reducing tension. In modern marketing, we need to consider creating tension to gain traction on the change we hope to serve up. The tension sends a message that says, “Where you are now is fine, but the place I am trying to take you will be even better.”

When we are trying to create the tension, we need to see what people fear. Modern marketing is about seeing everyone has different concerns and fears, and we are embracing those different concerns and working with them. We can talk to everyone in a way that they want to be talked to. We can talk to them about their fears.

Modern marketing is also about the concept of status. Successful marketing will address everyone’s need for status. Some will want higher status. Many will want their status unchanged if not improving. Some may even want a lower status, as they believe the lower status will help them hide better. Regardless of people’s status needs, marketing helps to address those needs.

The network effect also plays into the axiom of “People Like Us Do Things Like This.” Successful product/service is also the beneficiary of the network effect, and the idea spreads. Significant cultural changes usually have those products, services and ideas that have the network effect. Another word, those products, services, and ideas spread in a way that begs to be shared, because they are remarkable and worth talking about by the tribes.

The tribes are the natural clumps of people, people who want to be connected and seen doing some common things together. The tribes do not belong to anyone, but we can rise to lead them. We can lead them by connecting the tribe members, to commit to where they are hoping to go, and be the impresario that weave the necessary strands into the braids of a movement. We all try to seek out the meaning and our purposes. The tribes are the representations of those meanings and purposes, so we want to be a part of the tribe.

Regression Model for Online News Popularity Using Python Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average RMSE of 13020. Two algorithms (Linear Regression and ElasticNet) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 11273. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an RMSE of 12089, which was slightly worse than the accuracy of the training data.

CONCLUSION: For this iteration, the ElasticNet algorithm achieved the top training and validation results comparing to other machine learning algorithms. For this dataset, ElasticNet should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The HTML formatted report can be found here on GitHub.

Web Scraping of Quotes from Famous People using R Take 3

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in R and leveraged the rvest package.

INTRODUCTION: A demo website, created by Scrapinghub, lists quotes from famous people. It has many endpoints showing the quotes in different ways, and each endpoint presents a different scraping challenge for practicing web scraping. For this Take3 iteration, the R script attempts to scrape the quote information that is displayed via an infinite scrolling page.

Starting URLs: http://quotes.toscrape.com/scroll

The source code and JSON output can be found here on GitHub.

問題是他們的人數永遠比你多

(從我的一個喜歡與尊敬的作家,賽斯 高汀

你可以用你的一生去證明他人是錯的,但是那沒有什麼意義,這是一場打不贏的仗。

更有效的方法是去建立聯繫,去形成聯盟,以及在與你交往的人中出現最好的工作。

因為你不可能知道做為他人的感覺是如何,但其他人也不知道做你自已的感覺是怎樣。

這就是另一個原因為什麼我們該努力的去尋找他人好的一面。

Binary Classification Model for Online News Popularity Using Python Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 – Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

In iteration Take2, we examined the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to the cumulative importance of 0.99 (or 99%).

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 40.

ANALYSIS: From the previous iteration Take1, the baseline performance of the algorithms achieved an average accuracy of 59.95%. Three algorithms (Bagged CART, AdaBoost, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.38%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.89%, which was just slightly worse than the training data.

From the previous iteration Take2, the baseline performance of the algorithms achieved an average accuracy of 60.60%. Two ensemble algorithms (Bagged CART and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.34%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.70%, which was just slightly below the accuracy of the training data.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 61.08%. Two algorithms (Support Vector Machine and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 65.25%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 64.17%, which was just slightly below the accuracy of the training data.

From the model-building activities, the number of attributes went from 58 down to 44 after eliminating 14 attributes. The processing time went from 5 hours 56 minutes in iteration Take1 down to 1 hour 34 minutes in iteration Take3, which was a reduction of 73% from Take1. It also was a slight decrease in comparison to Take2, which reduced the processing time down to 1 hour 59 minutes.

CONCLUSION: The feature selection techniques helped by cutting down the attributes and reduced the training time. Furthermore, the modeling took a much shorter time to process yet still retained a comparable level of accuracy. For this dataset, the Stochastic Gradient Boosting algorithm and either the dimensionality reduction techniques should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The HTML formatted report can be found here on GitHub.