SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Series 2021 Jan dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have been hosting playground-style competitions on Kaggle with fun but less complex, tabular datasets. These competitions will be great for people looking for something between the Titanic Getting Started competition and a Featured competition.

ANALYSIS: The performance of the preliminary XGBoost model achieved an RMSE benchmark of 0.5068. After a series of tuning trials, the refined XGBoost model processed the training dataset with a final RMSE score of 0.4883. When we applied the last model to Kaggle’s test dataset, the model achieved an RMSE score of 0.6996.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Jan Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-jan-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-jan-2021/leaderboard

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Series 2021 Jan dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have been hosting playground-style competitions on Kaggle with fun but less complex, tabular datasets. These competitions will be great for people looking for something between the Titanic Getting Started competition and a Featured competition.

ANALYSIS: The average performance of the machine learning algorithms achieved an RMSE benchmark of 0.5276 using the training dataset. We selected ElasticNet and Extra Trees to perform the tuning exercises. After a series of tuning trials, the refined Extra Trees model processed the training dataset with a final RMSE score of 0.4949. When we apply the last model to Kaggle’s test dataset, the model achieved an RMSE score of 0.7038.

CONCLUSION: In this iteration, the Extra Trees model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Jan Data Set

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-jan-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-jan-2021/leaderboard

The HTML formatted report can be found here on GitHub.

]]>In this podcast, Seth discusses the three “malpractices” of presenting graphs and charts. We should understand these pitfalls and potential manipulation because it is always a sound idea to know what the graphs and charts are trying to say and why.

Showing someone a graph, a chart, or a poll is an intentional act. We are choosing something to show someone because we want to make a point. Very often, we do this to amplify the intent of a story. At the same time, we often do the presentation and violate three simple rules.

Malpractice number one is that we need to be careful with changing axes or scales when comparing two or more things. We often manipulate the scales to emphasize or exaggerate minor differences when, most of the time, the differences are not significant at all.

Malpractice number two is using various graphic elements to emphasize the points. A three-dimensional volume is different than a two-dimensional area. A two-dimensional area is different from a one-dimensional line. When we use a three-dimensional volume object to illustrate the change of a single axis, it could create a false impression of the impact of the change.

The third malpractice has to do with how we communicate poll results. First, many polls are not sufficiently random for the results to be beneficial. Second, the poll results often describe people’s feelings at the time of the survey, but the same people may act very differently after some time after the poll. We often mistake polls as reality or certainty when they are merely odds.

The takeaway lesson is that when someone presents a graph or chart, we need to understand the fundamental point that person is trying to convey. We need to ask the right questions to know whether the presenters are presenting things fairly or trying to make a point.

When we present the graphs and charts, we should strive to create charts and graphs that are inherently straightforward, honest, accurately constructed, and still illustrate our perspective. One tip on making a chart is to say precisely what the chart is trying to convey, strip away all the extraneous information to get to the underlying truth and present it as clearly as we can.

]]>一個增加價值，另一個則不然。 一種隨著時間的流逝創造價值，另一種則不會。

可以有趣的想像我們的支出是一種投資，但是如果真是如此，那我們就將其直接稱為投資。

我們的工具可以重複使用，我們的資產對我們和他人也都有價值。 技能可以是一項投資，隨著技能的增長而不斷增加。 但另一方面，費用的價值只會逐漸的黯淡。

]]>SUMMARY: This project aims to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a simple mean-reversion strategy for a stock. The model enters a position when the price reaches either the upper or lower Bollinger Bands for the last X number of days. The model will exit the trade when the stock price crosses the middle Bollinger Band of the same window size.

In iteration Take1, we set up the models using a trend window size for long trades only. The window size varied from 10 to 50 trading days at a 5-day increment, and we fixed the Bollinger Band factor at 2.0.

In this Take2 iteration, we will set up the models using a trend window size for long and short trades. The window size will vary from 10 to 50 trading days at a 5-day increment, and we will fix the Bollinger Band factor at 2.0.

ANALYSIS: In iteration Take1, we analyzed the stock prices for Costco Wholesale (COST) between January 1, 2016, and May 7, 2021. The top trading model produced a profit of 101.66 dollars per share. The buy-and-hold approach yielded a gain of 223.02 dollars per share.

In this Take2 iteration, we analyzed the stock prices for Costco Wholesale (COST) between January 1, 2016, and May 7, 2021. The top trading model produced a profit of -1.26 dollars per share. The buy-and-hold approach yielded a gain of 223.02 dollars per share.

CONCLUSION: For the stock of COST during the modeling time frame, the long-and-short trading strategy with fixed Bollinger Band factor did not produce a better return than the buy-and-hold approach. We should consider modeling this stock further by experimenting with more variations of the strategy.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: This project aims to construct a predictive model using a TensorFlow convolutional neural network (CNN) and document the end-to-end steps using a template. The PatchCamelyon Grand Challenge dataset is a binary-class classification situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: The PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating the presence of metastatic tissue. This dataset provides a useful benchmark for machine learning models that are bigger than CIFAR10 but smaller than ImageNet.

In iteration Take1, we constructed a CNN model using a simple three-block VGG architecture and tested the model’s performance using a held-out test dataset.

In iteration Take2, we constructed a CNN model using the InceptionV3 architecture and tested the model’s performance using a held-out test dataset.

In iteration Take3, we constructed a CNN model using the ResNet50 architecture and tested the model’s performance using a held-out test dataset.

In iteration Take4, we constructed a CNN model using the DenseNet121 architecture and tested the model’s performance using a held-out test dataset.

In this Take5 iteration, we will construct a CNN model using the MobileNetV3Small architecture and test the model’s performance using a held-out test dataset.

ANALYSIS: In iteration Take1, the baseline model’s performance achieved an accuracy score of 79.83% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take2, the InceptionV3 model’s performance achieved an accuracy score of 83.74% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take3, the ResNet50 model’s performance achieved an accuracy score of 85.09% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 78.05%.

In iteration Take4, the DenseNet121 model’s performance achieved an accuracy score of 85.62% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 80.01%.

In this Take5 iteration, the MobileNetV3Small model’s performance achieved an accuracy score of 82.63% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 78.34%.

CONCLUSION: In this iteration, the MobileNetV3Small CNN model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: PatchCamelyon Grand Challenge

Dataset ML Model: Binary-class image classification with numerical attributes

Dataset Reference: https://patchcamelyon.grand-challenge.org/

A potential source of performance benchmarks: https://patchcamelyon.grand-challenge.org/evaluation/challenge/leaderboard/

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: This project aims to construct and test an algorithmic trading model and document the end-to-end steps using a template.

INTRODUCTION: This algorithmic trading model examines a simple mean-reversion strategy for a stock. The model enters a position when the price reaches either the upper or lower Bollinger Bands for the last X number of days. The model will exit the trade when the stock price crosses the middle Bollinger Band of the same window size.

In this Take1 iteration, we will set up the models using a trend window size for long trades only. The window size will vary from 10 to 50 trading days at a 5-day increment, and we will fix the Bollinger Band factor at 2.0.

ANALYSIS: In this Take1 iteration, we analyzed the stock prices for Costco Wholesale (COST) between January 1, 2016, and May 7, 2021. The top trading model produced a profit of 101.66 dollars per share. The buy-and-hold approach yielded a gain of 223.02 dollars per share.

CONCLUSION: For the stock of COST during the modeling time frame, the long-only trading strategy with fixed Bollinger Band factor did not produce a better return than the buy-and-hold approach. We should consider modeling this stock further by experimenting with more variations of the strategy.

Dataset ML Model: Time series analysis with numerical attributes

Dataset Used: Quandl

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: This project aims to construct a predictive model using a TensorFlow convolutional neural network (CNN) and document the end-to-end steps using a template. The PatchCamelyon Grand Challenge dataset is a binary-class classification situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: The PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating the presence of metastatic tissue. This dataset provides a useful benchmark for machine learning models that are bigger than CIFAR10 but smaller than ImageNet.

In iteration Take1, we constructed a CNN model using a simple three-block VGG architecture and tested the model’s performance using a held-out test dataset.

In iteration Take2, we constructed a CNN model using the InceptionV3 architecture and tested the model’s performance using a held-out test dataset.

In iteration Take3, we constructed a CNN model using the ResNet50 architecture and tested the model’s performance using a held-out test dataset.

In this Take4 iteration, we will construct a CNN model using the DenseNet121 architecture and test the model’s performance using a held-out test dataset.

ANALYSIS: In iteration Take1, the baseline model’s performance achieved an accuracy score of 79.83% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take2, the InceptionV3 model’s performance achieved an accuracy score of 83.74% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take3, the ResNet50 model’s performance achieved an accuracy score of 85.09% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 78.05%.

In this Take4 iteration, the DenseNet121 model’s performance achieved an accuracy score of 85.62% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 80.01%.

CONCLUSION: In this iteration, the DenseNet121 CNN model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: PatchCamelyon Grand Challenge

Dataset ML Model: Binary-class image classification with numerical attributes

Dataset Reference: https://patchcamelyon.grand-challenge.org/

A potential source of performance benchmarks: https://patchcamelyon.grand-challenge.org/evaluation/challenge/leaderboard/

The HTML formatted report can be found here on GitHub.

]]>SUMMARY: This project aims to construct a predictive model using a TensorFlow convolutional neural network (CNN) and document the end-to-end steps using a template. The PatchCamelyon Grand Challenge dataset is a binary-class classification situation where we attempt to predict one of two possible outcomes.

INTRODUCTION: The PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating the presence of metastatic tissue. This dataset provides a useful benchmark for machine learning models that are bigger than CIFAR10 but smaller than ImageNet.

n iteration Take1, we constructed a CNN model using a simple three-block VGG architecture and tested the model’s performance using a held-out test dataset.

In iteration Take2, we constructed a CNN model using the InceptionV3 architecture and tested the model’s performance using a held-out test dataset.

In this Take3 iteration, we will construct a CNN model using the ResNet50 architecture and test the model’s performance using a held-out test dataset.

ANALYSIS: In iteration Take1, the baseline model’s performance achieved an accuracy score of 79.83% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In iteration Take2, the InceptionV3 model’s performance achieved an accuracy score of 83.74% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 79.00%.

In this Take3 iteration, the ResNet50 model’s performance achieved an accuracy score of 85.09% on the validation dataset after ten epochs. After we apply the final model to the test dataset, the model achieved an accuracy score of 78.05%.

CONCLUSION: In this iteration, the ResNet50 CNN model appeared to be suitable for modeling this dataset. We should consider experimenting with TensorFlow for further modeling.

Dataset Used: PatchCamelyon Grand Challenge

Dataset ML Model: Binary-class image classification with numerical attributes

Dataset Reference: https://patchcamelyon.grand-challenge.org/

A potential source of performance benchmarks: https://patchcamelyon.grand-challenge.org/evaluation/challenge/leaderboard/

The HTML formatted report can be found here on GitHub.

]]>In this podcast, Seth discusses the function of apertures on the camera lens and uses it as an example of how various gatekeepers have shaped or defined our culture.

Camera lenses are round, but pictures are square because the light goes through a camera lens via a tiny pinhole in the lens. It turns out that, through that little, small hole, plenty of photons can work their way to the other side and land on a square piece of film. The pinhole acts as a gatekeeper for the photons.

Many aspects of our culture also have corresponding gatekeepers. The music industry used to have many gatekeepers that work together in the music ecosystem. The supply chain was made up of listeners, radio program directors, record producers, media executives, and many specialized support personnel and teams.

These supply chains and gatekeepers together act an aperture, a tiny little hole between the people who create things and the market that is open to consuming them.

Over time, the music ecosystem evolves, we got rid of the gatekeepers in many ways. The scarcity of radio time slots and album shelf space is no longer the constraint for music publishing. While there is a portion of the population that wants to listen to what the gatekeepers pick out for them, acquiring an audience in the age of iTunes and YouTube has clearly illustrated the concept of “Long Tail” named by Chris Andersen.

More importantly, these gatekeepers in many industries who used to define or shape our culture are not driving the culture anymore. These gatekeepers existed because we need them to manage the scarcity in time slots or shelf space. Scarcity comes with opportunity costs. If we play this song on the radio during this time slot, we will not be able to play another music simultaneously.

Also, the traditional gatekeepers were conservative primarily because they were trying to appeal to the largest segment of the audience possible. They did not want to risk alienating any group of audience. Today, the dynamic in the media has shifted from the conservative end to going to the edges. It does not matter if something is not valid. If something bleeds, it leads.

Whether being conservative or being edgy is a critical consideration for those trying to do work that will have an impact on the culture. More likely, somewhere in the middle, there might be a sweet spot for us, the change agent.

As creators of culture, each of us has the chance to hone our voice, practice shipping the work, and figure out who is our smallest viable audiences. For those audiences, we need to learn to see them, understand them, cater to them, and give them something they want to share. If we can earn permission to do the work for those audiences, we can become our own gatekeepers.

Each of us needs to be responsible for what we put our name on. Each of us is going to have a following, small or big. What we do with that following is that we can no longer use it as an excuse. We need to stand up for what is right and to bring things we are proud of to the world.

The mega-hits will become rarer as the audience fragments into many long-tail segments. It is more likely we will end up somewhere closer to the middle where some people will be able to find their true fans and make the work they are proud of. Doing the work that makes us proud and not hiding behind a badge or a label is the only way to make things better.

]]>