Regression Model for Kaggle Tabular Playground Series 2021 Jan Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Series 2021 Jan dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have been hosting playground-style competitions on Kaggle with fun but less complex, tabular datasets. These competitions will be great for people looking for something between the Titanic Getting Started competition and a Featured competition.

ANALYSIS:  The performance of the preliminary XGBoost model achieved an RMSE benchmark of 0.5068. After a series of tuning trials, the refined XGBoost model processed the training dataset with a final RMSE score of 0.4883. When we applied the last model to Kaggle’s test dataset, the model achieved an RMSE score of 0.6996.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Jan Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-jan-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-jan-2021/leaderboard

The HTML formatted report can be found here on GitHub.

Regression Model for Kaggle Tabular Playground Series 2021 Jan Using Python and Scikit-learn

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Series 2021 Jan dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have been hosting playground-style competitions on Kaggle with fun but less complex, tabular datasets. These competitions will be great for people looking for something between the Titanic Getting Started competition and a Featured competition.

ANALYSIS: The average performance of the machine learning algorithms achieved an RMSE benchmark of 0.5276 using the training dataset. We selected ElasticNet and Extra Trees to perform the tuning exercises. After a series of tuning trials, the refined Extra Trees model processed the training dataset with a final RMSE score of 0.4949. When we apply the last model to Kaggle’s test dataset, the model achieved an RMSE score of 0.7038.

CONCLUSION: In this iteration, the Extra Trees model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Jan Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-jan-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-jan-2021/leaderboard

The HTML formatted report can be found here on GitHub.

Seth Godin’s Akimbo: Lying, Lying, Lying with Stats and More

In his Akimbo podcast, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth discusses the three “malpractices” of presenting graphs and charts. We should understand these pitfalls and potential manipulation because it is always a sound idea to know what the graphs and charts are trying to say and why.

Showing someone a graph, a chart, or a poll is an intentional act. We are choosing something to show someone because we want to make a point. Very often, we do this to amplify the intent of a story. At the same time, we often do the presentation and violate three simple rules.

Malpractice number one is that we need to be careful with changing axes or scales when comparing two or more things. We often manipulate the scales to emphasize or exaggerate minor differences when, most of the time, the differences are not significant at all.

Malpractice number two is using various graphic elements to emphasize the points. A three-dimensional volume is different than a two-dimensional area. A two-dimensional area is different from a one-dimensional line. When we use a three-dimensional volume object to illustrate the change of a single axis, it could create a false impression of the impact of the change.

The third malpractice has to do with how we communicate poll results. First, many polls are not sufficiently random for the results to be beneficial. Second, the poll results often describe people’s feelings at the time of the survey, but the same people may act very differently after some time after the poll. We often mistake polls as reality or certainty when they are merely odds.

The takeaway lesson is that when someone presents a graph or chart, we need to understand the fundamental point that person is trying to convey. We need to ask the right questions to know whether the presenters are presenting things fairly or trying to make a point.

When we present the graphs and charts, we should strive to create charts and graphs that are inherently straightforward, honest, accurately constructed, and still illustrate our perspective. One tip on making a chart is to say precisely what the chart is trying to convey, strip away all the extraneous information to get to the underlying truth and present it as clearly as we can.

投資與費用

(從我一個尊敬的作家,賽斯·高汀

一個增加價值,另一個則不然。 一種隨著時間的流逝創造價值,另一種則不會。

可以有趣的想像我們的支出是一種投資,但是如果真是如此,那我們就將其直接稱為投資。

我們的工具可以重複使用,我們的資產對我們和他人也都有價值。 技能可以是一項投資,隨著技能的增長而不斷增加。 但另一方面,費用的價值只會逐漸的黯淡。

Algorithmic Trading Model for Trend-Following with Bollinger Bands Strategy Using Python Take 2

NOTE: This script is for learning purposes only and does not constitute a recommendation for buying or selling any stock mentioned in this script.

SUMMARY: This project aims to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a simple mean-reversion strategy for a stock. The model enters a position when the price reaches either the upper or lower Bollinger Bands for the last X number of days. The model will exit the trade when the stock price crosses the middle Bollinger Band of the same window size.

In iteration Take1, we set up the models using a trend window size for long trades only. The window size varied from 10 to 50 trading days at a 5-day increment, and we fixed the Bollinger Band factor at 2.0.

In this Take2 iteration, we will set up the models using a trend window size for long and short trades. The window size will vary from 10 to 50 trading days at a 5-day increment, and we will fix the Bollinger Band factor at 2.0.

ANALYSIS: In iteration Take1, we analyzed the stock prices for Costco Wholesale (COST) between January 1, 2016, and May 7, 2021. The top trading model produced a profit of 101.66 dollars per share. The buy-and-hold approach yielded a gain of 223.02 dollars per share.

In this Take2 iteration, we analyzed the stock prices for Costco Wholesale (COST) between January 1, 2016, and May 7, 2021. The top trading model produced a profit of -1.26 dollars per share. The buy-and-hold approach yielded a gain of 223.02 dollars per share.

CONCLUSION: For the stock of COST during the modeling time frame, the long-and-short trading strategy with fixed Bollinger Band factor did not produce a better return than the buy-and-hold approach. We should consider modeling this stock further by experimenting with more variations of the strategy.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

Binary-Class Image Classification Deep Learning Model for PatchCamelyon Grand Challenge Using TensorFlow Take 5

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using a TensorFlow convolutional neural network (CNN) and document the end-to-end steps using a template. The PatchCamelyon Grand Challenge dataset is a binary-class classification situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: The PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating the presence of metastatic tissue. This dataset provides a useful benchmark for machine learning models that are bigger than CIFAR10 but smaller than ImageNet.

In iteration Take1, we constructed a CNN model using a simple three-block VGG architecture and tested the model’s performance using a held-out test dataset.

In iteration Take2, we constructed a CNN model using the InceptionV3 architecture and tested the model’s performance using a held-out test dataset.

In iteration Take3, we constructed a CNN model using the ResNet50 architecture and tested the model’s performance using a held-out test dataset.

In iteration Take4, we constructed a CNN model using the DenseNet121 architecture and tested the model’s performance using a held-out test dataset.

In this Take5 iteration, we will construct a CNN model using the MobileNetV3Small architecture and test the model’s performance using a held-out test dataset.

ANALYSIS: In iteration Take1, the baseline model’s performance achieved an accuracy score of 79.83% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take2, the InceptionV3 model’s performance achieved an accuracy score of 83.74% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take3, the ResNet50 model’s performance achieved an accuracy score of 85.09% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 78.05%.

In iteration Take4, the DenseNet121 model’s performance achieved an accuracy score of 85.62% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 80.01%.

In this Take5 iteration, the MobileNetV3Small model’s performance achieved an accuracy score of 82.63% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 78.34%.

CONCLUSION: In this iteration, the MobileNetV3Small CNN model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: PatchCamelyon Grand Challenge

Dataset ML Model: Binary-class image classification with numerical attributes

Dataset Reference: https://patchcamelyon.grand-challenge.org/

A potential source of performance benchmarks: https://patchcamelyon.grand-challenge.org/evaluation/challenge/leaderboard/

The HTML formatted report can be found here on GitHub.

Algorithmic Trading Model for Trend-Following with Bollinger Bands Strategy Using Python Take 1

NOTE: This script is for learning purposes only and does not constitute a recommendation for buying or selling any stock mentioned in this script.

SUMMARY: This project aims to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a simple mean-reversion strategy for a stock. The model enters a position when the price reaches either the upper or lower Bollinger Bands for the last X number of days. The model will exit the trade when the stock price crosses the middle Bollinger Band of the same window size.

In this Take1 iteration, we will set up the models using a trend window size for long trades only. The window size will vary from 10 to 50 trading days at a 5-day increment, and we will fix the Bollinger Band factor at 2.0.

ANALYSIS: In this Take1 iteration, we analyzed the stock prices for Costco Wholesale (COST) between January 1, 2016, and May 7, 2021. The top trading model produced a profit of 101.66 dollars per share. The buy-and-hold approach yielded a gain of 223.02 dollars per share.

CONCLUSION: For the stock of COST during the modeling time frame, the long-only trading strategy with fixed Bollinger Band factor did not produce a better return than the buy-and-hold approach. We should consider modeling this stock further by experimenting with more variations of the strategy.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

Binary-Class Image Classification Deep Learning Model for PatchCamelyon Grand Challenge Using TensorFlow Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using a TensorFlow convolutional neural network (CNN) and document the end-to-end steps using a template. The PatchCamelyon Grand Challenge dataset is a binary-class classification situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: The PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating the presence of metastatic tissue. This dataset provides a useful benchmark for machine learning models that are bigger than CIFAR10 but smaller than ImageNet.

In iteration Take1, we constructed a CNN model using a simple three-block VGG architecture and tested the model’s performance using a held-out test dataset.

In iteration Take2, we constructed a CNN model using the InceptionV3 architecture and tested the model’s performance using a held-out test dataset.

In iteration Take3, we constructed a CNN model using the ResNet50 architecture and tested the model’s performance using a held-out test dataset.

In this Take4 iteration, we will construct a CNN model using the DenseNet121 architecture and test the model’s performance using a held-out test dataset.

ANALYSIS: In iteration Take1, the baseline model’s performance achieved an accuracy score of 79.83% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take2, the InceptionV3 model’s performance achieved an accuracy score of 83.74% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take3, the ResNet50 model’s performance achieved an accuracy score of 85.09% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 78.05%.

In this Take4 iteration, the DenseNet121 model’s performance achieved an accuracy score of 85.62% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 80.01%.

CONCLUSION: In this iteration, the DenseNet121 CNN model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: PatchCamelyon Grand Challenge

Dataset ML Model: Binary-class image classification with numerical attributes

Dataset Reference: https://patchcamelyon.grand-challenge.org/

A potential source of performance benchmarks: https://patchcamelyon.grand-challenge.org/evaluation/challenge/leaderboard/

The HTML formatted report can be found here on GitHub.