Binary Classification Model for Coronary Artery Disease Using R Take 1

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Z-Alizadeh Sani CAD dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The researchers collected the data file for coronary artery disease (CAD) diagnosis. Each patient could be in two possible categories CAD or Normal. A patient is categorized as CAD, if his/her diameter narrowing is greater than or equal to 50%, and otherwise as Normal. The Z-Alizadeh Sani dataset contains the records of 303 patients, each of which has 59 features. The features can belong to one of four groups: demographic, symptom and examination, ECG, and laboratory and echo features. In this extension, the researchers add three features for the LAD, LCX, and RCA arteries. CAD becomes true when at least one of these three arteries is stenotic. To properly use this dataset for CAD classification only one of LAD, LCX, RCA or Cath (Result of angiography) can be present in the dataset. This dataset not only can be used for CAD detection, but also stenosis diagnosis of each LAD, LCX and RCA arteries.

In this iteration, we plan to establish the baseline prediction accuracy for further takes of modeling.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 83.07%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 89.19%. By using the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 77.78%, which was significantly below the prediction accuracy gained from the training data and possibly due to over-fitting.

CONCLUSION: For this iteration, the Gradient Boosting algorithm achieved the best overall training and validation results. For this dataset, the Gradient Boosting algorithm could be considered for further modeling.

Dataset Used: Z-Alizadeh Sani Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/extention+of+Z-Alizadeh+sani+dataset

The HTML formatted report can be found here on GitHub.

Pressfield on the Amateur Qualities, Part 3

In his book, Turning Pro, Steven Pressfield teaches us how to navigate the passage from amateur life to professional practice.

These are my takeaways from reading the book.

The amateur lives for the future.

We place much emphasis on getting what we want, as soon as possible and as cheaply as possible. We take on debts to finance our materials needs and confuse the debt as an investment. We look to getting what we want today without doing the hard work or asking the hard questions of why we want something. The amateur love to get what he wants today without paying anything right now.

The amateur lives in the past.

The amateur either looks forward to a hopeful future or spend much time looking backward. The amateur likes to relive the past glory and hope things will go back to the way they were. The past was gone, but the amateur still carries the baggage of the past that is no longer relevant today. By living in the past or the future, the amateur avoids doing the hard work that is required in the present.

The amateur will be ready tomorrow.

The amateur has a million plans, and they all start tomorrow. The professional may have only one plan, but she is busy working that plan right now.

The amateur gives his power away to others.

The amateur follows a guru or a mentor. They consider themselves a disciple of the master, and they act only with the master’s permission and blessing. When we wait for the master telling us what to do, we gave away the power to act on our own behalf. When we give away our power and wait to be told, we become a compliant cog, and we give ourselves the excuse we need to hide from the real, hard work.

The amateur is asleep.

The force that can save the amateur is awareness, particularly self-awareness. But to act upon this self-awareness would mean we must define ourselves and how we differentiate from others. When we take a stand to define ourselves, we open ourselves up to the judgment, criticism, and rejection of others. The amateur avoids self-definition and the responsibilities that come with it. They choose to hide by acting as an undifferentiated individual in the herd.

Using Docker to Build Data Science Environments with RStudio

I have been using Docker to create environments for data science work. With Docker, I was able to painlessly create the environments with a degree of accuracy and consistency. After getting exposed to using Docker for environment creation, it is hard to imagine doing it any other ways.

For more information, I encourage you to check out the Rocker images and this blog post about using RStudio and Docker.

Step 1: Create VM and update OS as necessary

I created virtual machines on VMware using CentOS 7 and make it accessible through bridged networking. I also used CentOS’ minimum installation as it just needs the basic components to run Docker. We will need to access the VM via SSH with the port 8787 opened for the
RStudio server instance.

Step 2: Access the VM via SSH (through a non-root user call docker_admin) and install Git with the sudo command.

Step 3: Install Docker for the non-root docker_admin user. Verify the installation with the command “docker image ls.”

More information on installing Docker CE can be found at here and here.

It boils down to:

sudo yum install -y yum-utils device-mapper-persistent-data lvm2
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum -y install docker-ce
sudo systemctl start docker && sudo systemctl enable docker
sudo usermod -aG docker docker_admin

Step 4: For my environments, I need to clone some R template scripts. This step is not mandatory if you do not require it.

git clone https://github.com/daines-analytics/template-latest.git examples

For my environments, I also need to make some environment variables accessible to the scripts. Again, this step may not be mandatory for your installation.

scp .Renviron cloud_user@<IP_Address>:/home/cloud_user

Step 5: Create the Dockerfile or use the one from the template directory

FROM rocker/verse
LABEL com.dainesanalytics.rstudio.version=v1.0
RUN Rscript -e "install.packages(c('knitr', 'tidyverse', 'caret', 'corrplot', 'mailR', 'DMwR', 'ROCR', 'Hmisc', 'randomForest', 'e1071', 'elasticnet', 'gbm', 'xgboost'))"
COPY --chown=rstudio:rstudio .Renviron /home/rstudio
COPY --chown=rstudio:rstudio examples/ /home/rstudio

My environments require many of the machine learning packages, but these packages may not be mandatory for your installation.

Step 6: Build the Docker image with the command:

docker image build -t rstudio/nonroot:v1 .

Step 7: Run the Docker container with the command:

docker container run --rm -e PASSWORD=rserver -p 8787:8787 --name
rstudio-server rstudio/nonroot:v1

The password can be any string, and the RStudio Server just requires one.

Step 8: After we are done with the container and/or the virtual machine, we can shut down the container with the command:

docker container stop [container ID]

The templates (R and Docker) can be found here on GitHub.

Breathe

In his podcast, Akimbo [https://www.akimbo.me/], Seth Godin teaches us how to adopt a posture of possibility, change the culture, and choose to make a difference. Here are my takeaways from the episode.

In this podcast, Seth used the atmosphere as an example to discuss how human beings think about the future and making changes.

Scientists have been measuring the changes happening to the atmosphere, and those changes are undisputed. This observed increase is believed to be the continuation of a trend which began in the middle of the last century with the start of the Industrial Revolution. Fossil fuel, combustion and the clearing of virgin forests are believed to be the primary contributors.

Along with those changes, we can also easily infer the negative impacts of having too much carbon dioxide in the atmosphere. The question is why is it so difficult for humans to do something about this atmospheric cancer.

Part of the reason for the difficult change is that the issue is portraited as a political one, rather than a scientific one. The leaders at major industrial companies, the major contributors of CO2, have decided to turn the issue into a political discussion because they do not wish to upset the investors and only wanted to maintain the status quo. When this issue becomes political, the industrial companies are no longer on the hook to take the initiatives to solve the problem. They have an escape from the responsibility and can use lobbying to influence the outcome in their favor.

Humans are not good at thinking about the future. We are, however, capable of putting an enormous amount of effort into last-minute emergencies. Human nature also does not like percentages, and most of us do not want to understand probability. Those things make it difficult to talk about how the world is going to be in twenty years. We further complicate the issue by make it political. That turns the scientific fact into about who is going to win and who is going to lose.

Let us not forget the fact that people in developing countries often live near the ocean. Those people have the smallest voice in this conversation and are going to be the most impacted. What we have on hand is a clash of science, industry, and politics in a way that makes it very difficult for our culture, our humanity, to react and respond appropriately.

We organize our culture around the idea of the easiest thing to sell, the easiest thing to talk about, and how do we keep things the way they are. We want to avoid the uncertain leap into the future at all costs.

Another easy thing to do in our culture is to create division. We divide ourselves by arguing with each other, pushing the other away, and saying “You are not on my team so go away.” We happen to have a media system that profits from such division.

One approach to address the issue is to portrait it like a chronic degenerative disease. What we have here is atmosphere cancer. It is a disease because it is scientific, it can be measured, and it is easy to test. Although we cannot deny it exists, but, if we hurry, we have an opportunity to do something about it.

We need to realize that change is the fuel of capitalism. The disruptive changes often lead to the next breakthrough to the next opportunity, which gives us a chance to make things better by making better things.

We need to resist our innate desire to put off the inevitable death at the end of the road and instead say “people like us do things like this.” What it means to be “people like us” is that we’re going to be thoughtful about what is obvious and clear. What it means to be “do things like this” is that we must be in sync about what something like this is. We have the opportunity to see what is happening and to open the door for the kind of innovation that can help make things better for everyone.

Using Docker to Build Data Science Environments with Anaconda

I have been using Docker to create environments for data science work. With Docker, I was able to painlessly create the environments with a degree of accuracy and consistency. After getting exposed to using Docker for environment creation, it is hard to imagine doing it any other ways.

For more information, I encourage you to check out the Anaconda images and this blog post about using Anaconda and Docker.

Step 1: Create VM and update OS as necessary

I created virtual machines on VMware using CentOS 7 and make it accessible through bridged networking. I also used CentOS’ minimum installation as it just needs the basic components to run Docker. We will need to access the VM via SSH with the port (8080 in my case) opened for the Jupyter notebook instance.

Step 2: Access the VM via SSH (through a non-root user call os_admin) and install Git with the sudo command.

Step 3: Install Docker for the non-root os_admin user. Verify the installation with the command “docker image ls.”

More information on installing Docker CE can be found at here and here.

It boils down to:

sudo yum install -y yum-utils device-mapper-persistent-data lvm2
sudo yum-config-manager --add-repo \
  https://download.docker.com/linux/centos/docker-ce.repo
sudo yum -y install docker-ce
sudo systemctl start docker && sudo systemctl enable docker
sudo usermod -aG docker docker_admin

Step 4: For my environments, I need to clone some Python template scripts. This step is not mandatory if you do not require it.

git clone https://github.com/daines-analytics/template-latest.git examples

For my environments, I also need to make some environment variables accessible to the scripts. Again, this step may not be mandatory for your installation.

scp docker_env.txt cloud_user@<IP_Address>:/home/cloud_user

Step 5: Create the Dockerfile or use the one from the template directory

FROM continuumio/anaconda3
LABEL com.dainesanalytics.anaconda.version=v1.0
EXPOSE 8080
RUN conda install -c conda-forge -y --freeze-installed imbalanced-learn xgboost
RUN useradd -ms /bin/bash dev_user
USER dev_user
WORKDIR /home/dev_user
COPY --chown=dev_user:dev_user examples/ /home/dev_user
CMD /opt/conda/bin/jupyter notebook --ip=0.0.0.0 --port=8080 --no-browser --notebook-dir=/home/dev_user

Step 6: Build the Docker image with the command:

docker image build -t anaconda3/nonroot:v1 .

Step 7: Run the Docker container with the command:

docker container run --rm --env-file docker_env.txt -p 8080:8080
--name jupyter-server anaconda3/nonroot:v1

Step 8: After we are done with the container and/or the virtual machine, we can shut down the container with the command:

docker container stop [container ID]

The templates (Python and Docker) can be found here on GitHub.

Web Scraping of Daines Analytics Blog Using BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in Python and leveraged the BeautifulSoup module.

INTRODUCTION: Daines Analytics hosts its blog at dainesanalytics.blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Daines Analytics’ RSS feed. This iteration of the script automatically traverses the RSS feed to capture all blog entries.

Starting URLs: https://dainesanalytics.blog/feed or https://dainesanalytics.blog/feed/?paged=1

The source code and JSON output can be found here on GitHub.

欣然去接受你的不勝任

(從我的一個喜歡與尊敬的作家,賽斯 高汀

你無法在每一件事都做到最好。我們都做不到。

問題是:你會如何來面對?當您在一個沒有承諾,時間或技能的區域,您會做些什麼?

一種方法是永遠不要談論它。這是不受限制的。做得不好,但假裝你沒做不好。

另一種方法是熱情地來討論它。努力去尋找一個可以用來避免做得不好事情的資源。去找到一個會挑戰你變得更好的隊列。去尋找一項更新和更好的方法來改善。

除了那樣,很難能想像當避免問題時會讓事情自己變得更好。

Binary Classification Model for Heart Disease Study Using Python Take 5

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Heart Disease dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The original database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by machine learning researchers to this date. The “num” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

In iteration Take1, we examined the Cleveland dataset and created a Logistic Regression model to fit the data.

In iteration Take2, we examined the Hungarian dataset and created a Logistic Regression model to fit the data.

In iteration Take3, we examined the Switzerland dataset and created an Extra Trees model to fit the data.

In iteration Take4, we examined the Long Beach VA dataset and created an Extra Trees model to fit the data.

In this iteration, we will combine all four datasets and look for a suitable machine learning model to fit the data.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 76.33%. Two algorithms (Logistic Regression and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Gradient Boosting turned in the top overall result and achieved an accuracy metric of 80.43%. By using the optimized parameters, the Gradient Boosting algorithm processed the testing dataset with an accuracy of 80.79%, which was slightly better than the prediction accuracy gained from the training data.

CONCLUSION: For the combined dataset, the Gradient Boosting algorithm achieved the best overall results using the training and testing datasets. For this dataset, Gradient Boosting should be considered for further modeling or production use.

Dataset Used: Heart Disease Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

One potential source of performance benchmark: https://www.kaggle.com/ronitf/heart-disease-uci

The HTML formatted report can be found here on GitHub.