“但是你怎麼能知道?”

(從我一個尊敬的作家,賽斯·高汀

知道一項事情有什麼用是值得的。 它可以幫助我們弄清楚該如何做得更好,該如何分配資源以及該如何知道何時完成工作。

我們所建立或投資的很多東西都是很複雜。 它們具有多種目的,必須取悅許多參與者,並且具有相互競爭的優先級。

所以問題來了:“我們怎麼能知道它是否有起作用?”這是一個強大的問題。

這問題也為有關真正的目的地打開了個有作用對話的大門。

Regression Model for Kaggle Tabular Playground Series 2021 Apr Using Python and XGBoost

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Apr 2021 dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions on Kaggle with fun but less complex, tabular datasets. The dataset used for this competition is synthetic but based on the real Titanic dataset and generated using a CTGAN. The statistical properties of this dataset are very similar to the original Titanic dataset, but there is no shortcut to cheat by using public labels for predictions.

ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 0.7707. After a series of tuning trials, the refined XGBoost model processed the training dataset with a final accuracy score of 0.7725. When we applied the last model to Kaggle’s test dataset, the model achieved a ROC score of 0.7832.

CONCLUSION: In this iteration, the XGBoost model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Apr Data Set

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-apr-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-apr-2021/leaderboard

The HTML formatted report can be found here on GitHub.

Regression Model for Kaggle Tabular Playground Series 2021 Apr Using Python and Scikit-learn

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Kaggle Tabular Playground Apr 2021 dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions on Kaggle with fun but less complex, tabular datasets. The dataset used for this competition is synthetic but based on the real Titanic dataset and generated using a CTGAN. The statistical properties of this dataset are very similar to the original Titanic dataset, but there is no shortcut to cheat by using public labels for predictions.

ANALYSIS: The average performance of the machine learning algorithms achieved an accuracy benchmark of 0.7253 using the training dataset. We selected k-Nearest Neighbors and Random Forest to perform the tuning exercises. After a series of tuning trials, the refined k-Nearest Neighbors model processed the training dataset with a final accuracy score of 0.7699. When we processed Kaggle’s test dataset with the final model, the model achieved an accuracy score of 0.7780.

CONCLUSION: In this iteration, the k-Nearest Neighbors model appeared to be a suitable algorithm for modeling this dataset.

Dataset Used: Kaggle Tabular Playground Series 2021 Apr Data Set

Dataset ML Model: Regression with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-apr-2021

One potential source of performance benchmarks: https://www.kaggle.com/c/tabular-playground-series-apr-2021/leaderboard

The HTML formatted report can be found here on GitHub.

Data Validation for Kaggle Tabular Playground Series Apr 2021 Using Python and TensorFlow Data Validation

SUMMARY: The project aims to construct a data validation flow using TensorFlow Data Validation (TFDV) and document the end-to-end steps using a template. The Kaggle Tabular Playground Series Apr 2021 dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions on Kaggle with fun but less complex, tabular datasets. The dataset used for this competition is synthetic but based on the real Titanic dataset and generated using a CTGAN. The statistical properties of this dataset are very similar to the original Titanic dataset, but there is no shortcut to cheat by using public labels for predictions.

Additional Notes: I adapted this workflow from the TensorFlow Data Validation tutorial on TensorFlow.org (https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic). I also plan to build a TFDV script for validating future datasets and building machine learning models.

CONCLUSION: In this iteration, the data validation workflow helped to validate the features and structures of the training, validation, and test datasets. The workflow also generated statistics over different slices of data which can help track model and anomaly metrics.

Dataset Used: Kaggle Tabular Playground 2021 Apr Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-apr-2021

The HTML formatted report can be found here on GitHub.

Data Validation for Kaggle Tabular Playground Series Mar 2021 Using Python and TensorFlow Data Validation

SUMMARY: The project aims to construct a data validation flow using TensorFlow Data Validation (TFDV) and document the end-to-end steps using a template. The Kaggle Tabular Playground Series Mar 2021 dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.

INTRODUCTION: Kaggle wants to provide an approachable environment for relatively new people in their data science journey. Since January 2021, they have hosted playground-style competitions on Kaggle with fun but less complex, tabular datasets. The dataset may be synthetic but is based on a real dataset and generated using a CTGAN. The original dataset tries to predict the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

Additional Notes: I adapted this workflow from the TensorFlow Data Validation tutorial on TensorFlow.org (https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic). I also plan to build a TFDV script for validating future datasets and building machine learning models.

CONCLUSION: In this iteration, the data validation workflow helped to validate the features and structures of the training, validation, and test datasets. The workflow also generated statistics over different slices of data which can help track model and anomaly metrics.

Dataset Used: Kaggle Tabular Playground 2021 Mar Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://www.kaggle.com/c/tabular-playground-series-mar-2021

The HTML formatted report can be found here on GitHub.

Data Validation for Chicago Taxi Trips Using Python and TensorFlow Data Validation

SUMMARY: The project aims to construct a data validation flow using TensorFlow Data Validation (TFDV) and document the end-to-end steps using a template. The Chicago Taxi Trips dataset is a regression situation where we attempt to predict the value of a continuous variable.

INTRODUCTION: The City of Chicago collects taxi trip data in its role as a regulatory agency. This example notebook illustrates how we can use TensorFlow Data Validation (TFDV) to investigate and visualize datasets. The data validation process includes examining descriptive statistics, inferring a schema, checking for and fixing anomalies, and detecting drift and skew in the dataset.

Additional Notes: I adapted this workflow from the TensorFlow Data Validation tutorial on TensorFlow.org (https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic). I also plan to build a TFDV script for validating future datasets and building machine learning models.

CONCLUSION: In this iteration, the data validation workflow helped to validate the features and structures of the training, validation, and test datasets. The workflow also generated statistics over different slices of data which can help track model and anomaly metrics.

Dataset Used: Chicago Taxi Trips Dataset, with modifications by TensorFlow.org

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/chicago_data.zip

The HTML formatted report can be found here on GitHub.

Charlie Gilkey on Start Finishing, Part 1

In his book, Start Finishing: How to go from idea to done, Charlie Gilkey discusses how we can follow a nine-step method to convert an idea into a project and get the project done via a reality-based schedule.

These are some of my favorite concepts and takeaways from reading the book.

Chapter 1, “Someday” Can Be Today

In this chapter, Charlie set up his methodology by discussing why converting an idea into a project is the crucial first step. He offers the following views for us to think about:

  • We often know we are not working on what matters the most. “We don’t do ideas – we do projects. A project is anything that requires time, energy, and attention to complete.”
  • “We thrive by doing our best work.” When we are doing our best work, we will always be on the edge of our capabilities and comfort levels. The primary consideration is not how our best work will support our livelihood but how our best work fits into a meaningful life.
  • We can think of our projects as mirrors and bridges. Projects are mirrors because the things we choose to work on reflect what is going on in our inner and outer worlds.
  • Projects are also bridges because we can create our souls’ paths when we are doing the work.
  • Some people are blessed to have a narrow set of interests that propel them to take a specific path. Most of us seem to have a set of “scattered” interests as we have difficulty fitting ourselves into one easy label. We should embrace the reality that there is not just one domain for our best work. At the same time, every aspect of our interests we choose to dip into will require upkeep in the form of projects.
  • “We can create new realities for ourselves, but only when we let go of the idea that we’re uniquely defective.”

自我與策略的混淆

(從我一個尊敬的作家,賽斯·高汀

我們是自我與我們的所做所為並不是完全一樣。

但是有時候,我們的所做可以改變我們的自我。

我們的自我是描述一位我們在鏡子中所看到的那個人,那是我們所識別的群體,也是我們一遍又一遍地回到自己(和現實)的版本。 “我不是個作家”,“我不是個企業家”或“我不是個領導者”是我們給的明確的陳述。

但是,當世界在不斷的變化時,時機也會隨之變化。

當我們的自我與周圍世界的現實不符合時,我們所有人都會將為之掙扎。

當面對這種混亂的時候,人們會很容易的放棄可能性,放棄去尋找機會,只是因為它不引起與我們這一刻自我的共鳴。 但是只有當我們去做新的事情時,我們才會開始形成一個新的自我。