Binary Classification Model for Truck APS Failure Using Scikit-Learn Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Truck APS Failure dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS), which generates pressurized air that supports functions such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The training set contains 60000 examples in total, in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

The challenge is to minimize the total cost of a prediction model the sum of “Cost_1” multiplied by the number of Instances with type 1 failure and “Cost_2” with the number of instances with type 2 failure. The “Cost_1” variable refers to the cost resulted from a redundant check by a mechanic at the workshop. Meanwhile, the “Cost_2” variable refers to the cost of not catching a faulty truck. The cost of Type I error (cost_1) is 10, while the cost of the Type II error (cost_2) is 500.

In this Take1 iteration, we will construct and tune machine learning models for this dataset using the Scikit-Learn library. We will observe the best sensitivity/recall score that we can obtain using the tuned models with the training and test datasets.

ANALYSIS: From this Take1 iteration, the performance of the machine learning algorithms achieved an average recall metric of 59.26%. Two algorithms (Extra Trees and Random Forest) produced the top results after the first round of modeling. After a series of tuning trials, the Random Forest model completed the training phase and achieved a score of 68.53%. When configured with the optimized learning parameters, the Random Forest model processed the validation dataset with a score of 66.40%. Furthermore, the optimized model processed the test dataset with a score of 71.73% with a high Type II error rate.

CONCLUSION: For this iteration, the Random Forest model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the Random Forest algorithm for further modeling and testing activities.

Dataset Used: APS Failure at Scania Trucks Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

One potential source of performance benchmark: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The HTML formatted report can be found here on GitHub.

Time Series Model for Yearly Copper Prices Using Python and ARIMA

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Yearly Copper Prices dataset is a time series situation where we are trying to forecast future outcomes based on past data points.

INTRODUCTION: The problem is to forecast the annual price of copper using the value of dollars in 1997 as the basis. The dataset describes a time-series of copper prices per ton (in dollars) over 198 years (1800-1996), and there are 197 observations. We used the first 80% of the observations for training various models while holding back the remaining observations for validating the final model.

ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 22.057. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (4, 1, 4). Furthermore, the chosen model processed the validation data with an RMSE of 21.456, which was better than the baseline model as expected.

CONCLUSION: For this dataset, the chosen ARIMA model achieved a satisfactory result, and we should consider using the algorithm for further modeling.

Dataset Used: Yearly Copper Prices 1800 through 1996

Dataset ML Model: Time series forecast with numerical attributes

Dataset Reference: Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0. https://pkg.yangzhuoranyang./tsdl/

The HTML formatted report can be found here on GitHub.

Multi-Class Classification Model for Crop Mapping in Canada Using TensorFlow Take 3

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Crop Mapping in Canada dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: This data set is a fused bi-temporal optical-radar data for cropland classification. The organization collected the images using RapidEye satellites (optical) and the Unmanned Aerial Vehicle Synthetic Aperture Radar (UAVSAR) system (Radar) over an agricultural region near Winnipeg, Manitoba, Canada in 2012. There are 2 x 49 radar features and 2 x 38 optical elements for two dates: 05 and 14 July 2012. Seven crop type classes exist for this data set as follows: 1-Corn; 2-Peas; 3-Canola; 4-Soybeans; 5-Oats; 6-Wheat; and 7-Broadleaf.

In the previous Scikit-Learn iterations, we constructed and tuned machine learning models for this dataset using the Scikit-Learn and the XGboost libraries. We also observed the best accuracy result that we could obtain using the tuned models with the training, validation, and test datasets.

In iteration Take1, we constructed and tuned machine learning models for this dataset using TensorFlow with three layers. We also observed the best accuracy result that we could obtain using the tuned models with the validation and test datasets.

In iteration Take2, we constructed and tuned machine learning models for this dataset using TensorFlow with four layers. We also observed the best accuracy result that we could obtain using the tuned models with the validation and test datasets.

In this Take3 iteration, we will construct and tune machine learning models for this dataset using TensorFlow with five layers. We will observe the best accuracy result that we can obtain using the tuned models with the validation and test datasets.

ANALYSIS: From the previous Scikit-Learn iterations, the optimized Extra Trees model processed the testing dataset with an accuracy of 99.74%. The optimized XGBoost model processed the testing dataset with an accuracy of 99.84%.

From iteration Take1, the performance of the three-layer TensorFlow model achieved an accuracy score of 99.82% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with an accuracy score of 99.83%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with an accuracy score of 99.84%, which was consistent with the prediction results from the training and validation phases.

From iteration Take2, the performance of the four-layer TensorFlow model achieved an accuracy score of 99.72% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with an accuracy score of 99.82%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with an accuracy score of 99.85%, which was consistent with the prediction results from the training and validation phases.

From this Take3 iteration, the performance of the five-layer TensorFlow model achieved an accuracy score of 99.83% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with an accuracy score of 99.82%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with an accuracy score of 99.83%, which was consistent with the prediction results from the training and validation phases.

CONCLUSION: For this dataset, the five-layer TensorFlow model achieved a satisfactory result, and we should consider using it for future modeling activities.

Dataset Used: Crop Mapping in Canada Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+optical-radar+data+set

The HTML formatted report can be found here on GitHub.

Multi-Class Classification Model for Crop Mapping in Canada Using TensorFlow Take 2

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Crop Mapping in Canada dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: This data set is a fused bi-temporal optical-radar data for cropland classification. The organization collected the images using RapidEye satellites (optical) and the Unmanned Aerial Vehicle Synthetic Aperture Radar (UAVSAR) system (Radar) over an agricultural region near Winnipeg, Manitoba, Canada in 2012. There are 2 x 49 radar features and 2 x 38 optical elements for two dates: 05 and 14 July 2012. Seven crop type classes exist for this data set as follows: 1-Corn; 2-Peas; 3-Canola; 4-Soybeans; 5-Oats; 6-Wheat; and 7-Broadleaf.

In the previous Scikit-Learn iterations, we constructed and tuned machine learning models for this dataset using the Scikit-Learn and the XGboost libraries. We also observed the best accuracy result that we could obtain using the tuned models with the training, validation, and test datasets.

In iteration Take1, we constructed and tuned machine learning models for this dataset using TensorFlow with three layers. We also observed the best accuracy result that we could obtain using the tuned models with the validation and test datasets.

In this Take2 iteration, we will construct and tune machine learning models for this dataset using TensorFlow with four layers. We will observe the best accuracy result that we can obtain using the tuned models with the validation and test datasets.

ANALYSIS: From the previous Scikit-Learn iterations, the optimized Extra Trees model processed the testing dataset with an accuracy of 99.74%. The optimized XGBoost model processed the testing dataset with an accuracy of 99.84%.

From iteration Take1, the performance of the three-layer TensorFlow model achieved an accuracy score of 99.82% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with an accuracy score of 99.83%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with an accuracy score of 99.84%, which was consistent with the prediction results from the training and validation phases.

From this Take2 iteration, the performance of the four-layer TensorFlow model achieved an accuracy score of 99.72% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with an accuracy score of 99.82%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with an accuracy score of 99.85%, which was consistent with the prediction results from the training and validation phases.

CONCLUSION: For this dataset, the four-layer TensorFlow model achieved a satisfactory result, and we should consider using it for future modeling activities.

Dataset Used: Crop Mapping in Canada Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+optical-radar+data+set

The HTML formatted report can be found here on GitHub.

Seth Godin’s Akimbo: Levi Strauss and the Gold Rush

In his Akimbo podcast, Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth discusses culture shifts and why we should pay proper attention to those shifts so we can respond to them effectively.

Seth used Levi Strauss as the example of an organization that responded to trend and culture shifts. The fashion of jeans and Levi Strauss benefited from several significant cultural and societal changes.

One of those changes was the shift from a distributed and agrarian economy to one that was based on manufacturing industrialism. Another one was the event of World War II, where the government deemed the jeans as an essential war-time item.

After the war, the growth of industrialism and the popular culture adopted jeans as part of their uniform. Subsequently, the revolution of retail trend (e.g., The Gap stores) further fueled the growth of jeans and Levin Strauss as the leader of that fashion segment.

When there is a shift in the culture or economics, the change can force or open the door for a company to change its behavior as it grew. Often those cultural or technological shifts can be so profound. When one shift is happening around us, we do not have to be at the epicenter of it for it to shift how we do our job or how we spend our day.

The shift brought about by the Internet is having the same profound effect on what we do and how we do things. The change is primarily a technological shift, and it is also a significant rewiring of our culture. The difference in network connectivity has enabled so many other changes in our culture.

One lesson we can learn from Levi Strauss’s growth is this. Levi Strauss did not cause the gold rush or World War II. The Levi Strauss company did not create the 1960s or even the spread of the Gap stores. When we add it all up, what we see is that every single time this company has grown and become more critical, they have done it because they have responded to the way the world is changing. They did not merely react but responded by working with that shift in the culture and doing something meaningful and vital with it.

With the arrival of the Internet, the same opportunity or threat is available to each of us. We went from not knowing what it is to a world that was completely different from the culture and commerce that was only a couple of decades before.

If we are going to build an entity today, we will need to build it on the idea of working with the plasticity of culture. That idea tells us that we can respond to a world that is being enabled by a technology that we do not even need to understand.

But what we must do is figure out how we are going to take these shifts and do something with them that we will be proud of. By responding to the changes or the “New Normal” effectively means our response and actions will create value.

“這是個好主意”

(從我一個尊敬的作家,賽斯·高汀

“然後下面會發生什麼?”

重複第二個問題100次。 因為每個好主意之後,至少還要有100個步驟的迭代,學習,調整,創新和努力。

當從錯誤的想法去開始只會浪費精力和時間。

但是不執行這下100個步驟會浪費一個好主意。

我們對完美的想法會給自己施加了很大的壓力,因為它會使我們從現實中分散了精力,也因為想法之後的數百步將使一切變得不同。 您可以指向的每個組織幾乎都是基於一個並非原創或完美的想法來構建的。

付出的努力,投入和不斷發展才能帶來改變。

Multi-Class Classification Model for Crop Mapping in Canada Using TensorFlow Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Crop Mapping in Canada dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: This data set is a fused bi-temporal optical-radar data for cropland classification. The organization collected the images using RapidEye satellites (optical) and the Unmanned Aerial Vehicle Synthetic Aperture Radar (UAVSAR) system (Radar) over an agricultural region near Winnipeg, Manitoba, Canada in 2012. There are 2 x 49 radar features and 2 x 38 optical elements for two dates: 05 and 14 July 2012. Seven crop type classes exist for this data set as follows: 1-Corn; 2-Peas; 3-Canola; 4-Soybeans; 5-Oats; 6-Wheat; and 7-Broadleaf.

In the previous Scikit-Learn iterations, we constructed and tuned machine learning models for this dataset using the Scikit-Learn and the XGBoost libraries. We also observed the best accuracy result that we could obtain using the tuned models with the training, validation, and test datasets.

In this Take1 iteration, we will construct and tune machine learning models for this dataset using TensorFlow with three layers. We will observe the best accuracy result that we can obtain using the tuned models with the validation and test datasets.

ANALYSIS: From the previous Scikit-Learn iterations, the optimized Extra Trees model processed the testing dataset with an accuracy of 99.74%. The optimized XGBoost model processed the testing dataset with an accuracy of 99.84%.

From this Take1 iteration, the performance of the three-layer TensorFlow model achieved an accuracy score of 99.82% with the training dataset. After a series of tuning trials, the TensorFlow model processed the validation dataset with an accuracy score of 99.83%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the TensorFlow model processed the test dataset with an accuracy score of 99.84%, which was consistent with the prediction results from the training and validation phases.

CONCLUSION: For this dataset, the three-layer TensorFlow model achieved a satisfactory result, and we should consider using it for future modeling activities.

Dataset Used: Crop Mapping in Canada Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+optical-radar+data+set

The HTML formatted report can be found here on GitHub.

Multi-Class Classification Model for Crop Mapping in Canada Using Scikit-Learn Take 4

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Crop Mapping in Canada dataset is a multi-class classification situation where we are trying to predict one of several (more than two) possible outcomes.

INTRODUCTION: This data set is a fused bi-temporal optical-radar data for cropland classification. The organization collected the images using RapidEye satellites (optical) and the Unmanned Aerial Vehicle Synthetic Aperture Radar (UAVSAR) system (Radar) over an agricultural region near Winnipeg, Manitoba, Canada in 2012. There are 2 * 49 radar features and 2 * 38 optical features for two dates: 05 and 14 July 2012. Seven crop type classes exist for this data set as follows: 1-Corn; 2-Peas; 3-Canola; 4-Soybeans; 5-Oats; 6-Wheat; and 7-Broadleaf.

In iteration Take1, we constructed and tuned machine learning models for this dataset using the Scikit-Learn library. We also observed the best accuracy result that we could obtain using the tuned model with the training and test datasets.

In iteration Take2, we made a slight modification to the Take1 experiment by creating an intermediate dataset for validating the models after training. Furthermore, we observed the best accuracy result that we could obtain using the tuned model with the set-aside test dataset

In iteration Take3, we constructed and tuned machine learning models for this dataset using the XGBoost library. We also observed the best accuracy result that we could obtain using the tuned model with the training and test datasets.

In this Take4 iteration, we will make a slight modification to the Take3 experiment by creating an intermediate dataset for validating the models after training. Furthermore, we will observe the best accuracy result that we can obtain using the tuned model with the set-aside test dataset.

ANALYSIS: From iteration Take1, the performance of the machine learning algorithms achieved a baseline average accuracy of 99.24%. Two algorithms (Extra Trees and k-Nearest Neighbors) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, the Extra Trees model processed the training dataset with an accuracy score of 99.72%. When configured with the optimized parameters, the Extra Trees model processed the test dataset with an accuracy score of 99.74%, which was consistent with the prediction from the training dataset.

From iteration Take2, the performance of the machine learning algorithms achieved a baseline average accuracy of 99.18%. Two algorithms (Extra Trees and k-Nearest Neighbors) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, the Extra Trees model processed the validation dataset with an accuracy score of 99.72%. When configured with the optimized parameters, the Extra Trees model processed the test dataset with an accuracy score of 99.74%, which was consistent with the prediction results from the training and validation phases.

From iteration Take3, the performance of the XGBoost model achieved an accuracy score of 99.83%. After a series of tuning trials, the XGBoost model processed the test dataset with an accuracy score of 99.83%, which was consistent with the prediction results from the training phase.

From this Take4 iteration, the performance of the XGBoost model achieved an accuracy score of 99.79% with the training dataset. After a series of tuning trials, the XGBoost model processed the validation dataset with an accuracy score of 99.80%, which was consistent with the prediction from the training result. When configured with the optimized parameters, the XGBoost model processed the test dataset with an accuracy score of 99.84%, which was consistent with the prediction results from the training and validation phases.

CONCLUSION: For this iteration, the XGBoost model achieved the best overall results using the training and test datasets. For this dataset, we should consider using the XGBoost algorithm for further modeling.

Dataset Used: Crop Mapping in Canada Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+optical-radar+data+set

The HTML formatted report can be found here on GitHub.