Machine Learning Prediction of Heart Diseases
Source codes:- Heart diseases from GitHub
Dataset :- Heart disease csv file
Objective
This dataset dates from 1988 and include some several databases of different country. On here I use this dataset from Kaggle website for predict the statement of heart disease using Machine Learning Algorithms.
The main object is to predict whether the given patient is having heart disease or not. So it helped for that some several variables such as, age, sex, cholesterol level, blood pressure, types of chest pains etc.
Here I used Machine Learning algorithms which are Decision Tree & Random Forest.
Let's begin by loading the required, needed libraries and importing the Heart_Disease.csv datasheet.
library(readr)
library(ggplot2)
library(Hmisc)
library(forcats)
library(dplyr)
library(psych)
library(C50)
library(caret)
library(randomForest)
df <- read_csv("newdataset/Heart_Disease.csv")
head(df)
## # A tibble: 6 x 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <chr> <chr> <dbl> <dbl> <lgl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 63 Male asympt~ 145 233 TRUE 0 150 no 2.3 upslo~
## 2 37 Male non-an~ 130 250 FALSE 1 187 no 3.5 upslo~
## 3 41 Female atypic~ 130 204 FALSE 0 172 no 1.4 downs~
## 4 56 Male atypic~ 120 236 FALSE 1 178 no 0.8 downs~
## 5 57 Female typica~ 120 354 FALSE 1 163 yes 0.6 downs~
## 6 57 Male typica~ 140 192 FALSE 1 148 no 0.4 flat
## # ... with 3 more variables: ca <dbl>, thal <chr>, target <chr>
Here head function gives us the top 6 examples of this data for much understanding.
The independent variables are related to the patients details and based on those variables we select the dependent variable as statement of disease which have or not.
age - age in years
sex - sex of the patients
cp - chest pain type
trestbps - resting blood pressure (in mm Hg on admission to the hospital)
chol - serum cholesterols in mh/dl
fbs - (fasting blood sugar > 120 mg/dl)
restecg - resting electrocardiographic result (0- normal; 1- having ST-T wave abnormality; 2- showing probable)
thalach - maximum heart rate achieved
exang - exercise include angina
oldpeak - ST depression induced by exercise relative to rest
slope - the slope of the peak exercise ST segment
ca - number of major vessels (0-3) colored by fluoroscopy
thal - thalassemia
target - disease status
According to nature of the features we implementing the changes.
df$sex <- as.factor(df$sex)
df$cp <- as.factor(df$cp)
df$fbs <- as.factor(df$fbs)
df$restecg <- as.factor(df$restecg)
df$exang <- as.factor(df$exang)
df$slope <- as.factor(df$slope)
df$thal <- as.factor(df$thal)
df$target <- as.factor(df$target)
str(df)
## spec_tbl_df [303 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:303] 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "asymptomatic",..: 1 3 2 2 4 4 2 2 3 3 ...
## $ trestbps: num [1:303] 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : num [1:303] 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 2 1 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : num [1:303] 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num [1:303] 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "downsloping",..: 3 3 1 1 1 2 2 1 1 1 ...
## $ ca : num [1:303] 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : Factor w/ 3 levels "fixed defect",..: 2 1 1 1 1 2 1 3 3 1 ...
## $ target : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_character(),
## .. cp = col_character(),
## .. trestbps = col_double(),
## .. chol = col_double(),
## .. fbs = col_logical(),
## .. restecg = col_double(),
## .. thalach = col_double(),
## .. exang = col_character(),
## .. oldpeak = col_double(),
## .. slope = col_character(),
## .. ca = col_double(),
## .. thal = col_character(),
## .. target = col_character()
## .. )
From the srt() function we can get an idea about structure of the dataset to fine which are the characters and numeric.
Hmisc::describe(df)
## df
##
## 14 Variables 303 Observations
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 303 0 41 0.999 54.37 10.36 39.1 42.0
## .25 .50 .75 .90 .95
## 47.5 55.0 61.0 66.0 68.0
##
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 303 0 2
##
## Value Female Male
## Frequency 96 207
## Proportion 0.317 0.683
## --------------------------------------------------------------------------------
## cp
## n missing distinct
## 303 0 4
##
## Value asymptomatic atypical angina non-anginal pain typical angina
## Frequency 23 50 87 143
## Proportion 0.076 0.165 0.287 0.472
## --------------------------------------------------------------------------------
## trestbps
## n missing distinct Info Mean Gmd .05 .10
## 303 0 49 0.995 131.6 19.32 108 110
## .25 .50 .75 .90 .95
## 120 130 140 152 160
##
## lowest : 94 100 101 102 104, highest: 174 178 180 192 200
## --------------------------------------------------------------------------------
## chol
## n missing distinct Info Mean Gmd .05 .10
## 303 0 152 1 246.3 55.95 175.0 188.0
## .25 .50 .75 .90 .95
## 211.0 240.0 274.5 308.8 326.9
##
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## --------------------------------------------------------------------------------
## fbs
## n missing distinct
## 303 0 2
##
## Value FALSE TRUE
## Frequency 258 45
## Proportion 0.851 0.149
## --------------------------------------------------------------------------------
## restecg
## n missing distinct
## 303 0 3
##
## Value 0 1 2
## Frequency 147 152 4
## Proportion 0.485 0.502 0.013
## --------------------------------------------------------------------------------
## thalach
## n missing distinct Info Mean Gmd .05 .10
## 303 0 91 1 149.6 25.77 108.1 116.0
## .25 .50 .75 .90 .95
## 133.5 153.0 166.0 176.6 181.9
##
## lowest : 71 88 90 95 96, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## exang
## n missing distinct
## 303 0 2
##
## Value no yes
## Frequency 204 99
## Proportion 0.673 0.327
## --------------------------------------------------------------------------------
## oldpeak
## n missing distinct Info Mean Gmd .05 .10
## 303 0 40 0.964 1.04 1.225 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.8 1.6 2.8 3.4
##
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## --------------------------------------------------------------------------------
## slope
## n missing distinct
## 303 0 3
##
## Value downsloping flat upsloping
## Frequency 142 140 21
## Proportion 0.469 0.462 0.069
## --------------------------------------------------------------------------------
## ca
## n missing distinct Info Mean Gmd
## 303 0 5 0.795 0.7294 1.005
##
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##
## Value 0 1 2 3 4
## Frequency 175 65 38 20 5
## Proportion 0.578 0.215 0.125 0.066 0.017
## --------------------------------------------------------------------------------
## thal
## n missing distinct
## 303 0 3
##
## Value fixed defect normal reversible defect
## Frequency 166 20 117
## Proportion 0.548 0.066 0.386
## --------------------------------------------------------------------------------
## target
## n missing distinct
## 303 0 2
##
## Value NO YES
## Frequency 138 165
## Proportion 0.455 0.545
## --------------------------------------------------------------------------------
From describe function we can explore the content of the variable more than summary function.
According to these things we can extract some of the features of the data as graphically. Then we can observe the things more than just appearing.
Correlations
pairs.panels(df[c("age", "trestbps", "chol", "thalach", "oldpeak")])
From above output the diagonal, the scatterplots has been replaced with a correlation matrix. On the diagonal, a histogram depicting the distribution of values for each feature is shown. Finally, the scatterplots below the diagonal now are presented with additional visual information.
So we have highest correlation between age & maximum heart. That is a negative (-0.4); lowest correlation between maximum heart & serum cholesterols (-0.01)
(The oval-shaped object on each scatterplot is a correlation ellipse. It provides a visualization of how strongly correlated the variables are. The dot at the center of the ellipse indicates the point of the mean value for the x axis variable and y axis variable. The correlation between the two variables is indicated by the shape of the ellipse; the more it is stretched, the stronger the correlation.)
Visualization of the gender
Hmisc::describe(df$sex)
## df$sex
## n missing distinct
## 303 0 2
##
## Value Female Male
## Frequency 96 207
## Proportion 0.317 0.683
ggplot(data = df) +
geom_bar(mapping = aes(sex), fill = c("lightyellow3", "lavenderblush3")) +
labs(
title = paste("Gender"),
y = "Number of patients",
x = "Gender" ) +
theme_dark()
There are much more male patients than female. It's like half of the male patients.
Visualization of the chest pain type
Hmisc::describe(df$cp)
## df$cp
## n missing distinct
## 303 0 4
##
## Value asymptomatic atypical angina non-anginal pain typical angina
## Frequency 23 50 87 143
## Proportion 0.076 0.165 0.287 0.472
ggplot(data = df,
mapping = aes(cp))+
geom_bar(fill = c(rep(c("aquamarine4", "aquamarine3",
"aquamarine2", "aquamarine1"),
times = 2))) +
facet_wrap(~sex, nrow = 2) +
scale_y_continuous(breaks = seq(0, 100, by = 10))+
labs(
title = "Chest pain types by Gender",
x = "Pain Type", y = "Number of patients") +
coord_flip() +
theme_minimal()
Here the typical angina is the most pain type of the chest of both sex. In addition to that there are male patients over 100 of that pain type.
Visualization of the resting blood pressure Vs serum cholesterols (highest values)
Hmisc::describe(df[4:5])
## df[4:5]
##
## 2 Variables 303 Observations
## --------------------------------------------------------------------------------
## trestbps
## n missing distinct Info Mean Gmd .05 .10
## 303 0 49 0.995 131.6 19.32 108 110
## .25 .50 .75 .90 .95
## 120 130 140 152 160
##
## lowest : 94 100 101 102 104, highest: 174 178 180 192 200
## --------------------------------------------------------------------------------
## chol
## n missing distinct Info Mean Gmd .05 .10
## 303 0 152 1 246.3 55.95 175.0 188.0
## .25 .50 .75 .90 .95
## 211.0 240.0 274.5 308.8 326.9
##
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## --------------------------------------------------------------------------------
(bp <- filter(df, trestbps >= 135, chol >= 250) %>%
select(sex, chol, trestbps,fbs) %>%
arrange(trestbps))
## # A tibble: 52 x 4
## sex chol trestbps fbs
## <fct> <dbl> <dbl> <fct>
## 1 Female 304 135 TRUE
## 2 Female 252 135 FALSE
## 3 Female 250 135 FALSE
## 4 Male 254 135 FALSE
## 5 Female 319 136 TRUE
## 6 Male 315 136 FALSE
## 7 Male 257 138 FALSE
## 8 Male 271 138 FALSE
## 9 Male 282 138 TRUE
## 10 Female 294 138 TRUE
## # ... with 42 more rows
ggplot(bp, mapping = aes(x = chol, y =trestbps)) +
geom_point(col = "chartreuse")+
facet_wrap(fbs~sex) +
labs(
title = "Resting blood pressure Vs Serum cholesterols",
subtitle = "In order to Gender & Fasting blood sugar",
caption = "-Highest values (trestbps >= 135, chol >= 250)",
x = "Serum cholesterols", y = "Resting blood pressure"
) +
scale_y_continuous(breaks = seq(140, 200, by = 10)) +
scale_x_continuous(breaks = seq(250, 500, by = 25)) +
theme_dark()
fbs(fasting blood sugar > 120 mg/dl) is divided as True & False logistic variable with Gender. Exploring these graphs we can observe that only 4 male patients' classify as high fbs and most of the patients are lower fbs who are male.
Visualization of heart rate Vs age
Hmisc::describe(df[7:9])
## df[7:9]
##
## 3 Variables 303 Observations
## --------------------------------------------------------------------------------
## restecg
## n missing distinct
## 303 0 3
##
## Value 0 1 2
## Frequency 147 152 4
## Proportion 0.485 0.502 0.013
## --------------------------------------------------------------------------------
## thalach
## n missing distinct Info Mean Gmd .05 .10
## 303 0 91 1 149.6 25.77 108.1 116.0
## .25 .50 .75 .90 .95
## 133.5 153.0 166.0 176.6 181.9
##
## lowest : 71 88 90 95 96, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## exang
## n missing distinct
## 303 0 2
##
## Value no yes
## Frequency 204 99
## Proportion 0.673 0.327
## --------------------------------------------------------------------------------
ggplot(data = df, mapping = aes(x = age, y = thalach)) +
geom_boxplot(fill = "grey25", col = "white") +
geom_point(color = "darkgoldenrod1") +
facet_wrap(restecg~exang~sex, nrow = 1) +
scale_y_continuous(breaks = seq(80,200, by = 10)) +
labs(
title = "Maximum heart rate Vs Age",
subtitle = "-In order to Electrocardiographic result, angina & Gender",
x = "Age in year", y = "Maximum heart rate") +
theme_dark()
Maximum heart rate achieved has been taken here for explore the variation with age in years. Boxplot and scatter plot have carried out here by them we can see how the data point are varying with in order to Electrocardiographic result, angina & Gender.
The resting electrocardiographic result are taken as; 0- normal, 1- having ST-T wave abnormality & 2- showing probable. And we can observe here only few patients have that problem.
The exercise include angina are taken as Yes or No category.
In order to these things, we are able to explore the data as much easier.
Visualization of thalassemia
Hmisc::describe(df[12:13])
## df[12:13]
##
## 2 Variables 303 Observations
## --------------------------------------------------------------------------------
## ca
## n missing distinct Info Mean Gmd
## 303 0 5 0.795 0.7294 1.005
##
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##
## Value 0 1 2 3 4
## Frequency 175 65 38 20 5
## Proportion 0.578 0.215 0.125 0.066 0.017
## --------------------------------------------------------------------------------
## thal
## n missing distinct
## 303 0 3
##
## Value fixed defect normal reversible defect
## Frequency 166 20 117
## Proportion 0.548 0.066 0.386
## --------------------------------------------------------------------------------
ggplot(df, mapping = aes(x = ca)) +
geom_bar(fill = "seagreen1") +
facet_wrap(~thal ) +
labs(
title = "Thalassemia disease case variation",
subtitle = "-In order to vessels colored by fluoroscop",
x = "vessels colored by fluoroscop", y = "Number of patients") +
theme_dark()
We can observed most of patients include in fixed defect thalassemia disease and few patients have normal case of thalassemia.
Performing Decision Tree Machine Learning Algorithm
Preparation the data
Firstly, we are dividing our dataset into training and test data as 80% and 20% of the initial. (243 records for training & 60 records for test)
So creating that datasets we should consider about the random order of the data.
We'll solve this problem randomly ordering our heart disease data set prior splitting.
The order() function is used to rearrange the list. For random number generation, we'll use runif() function.
set.seed(12345)
df_rand <- df[order(runif(303)), ]
df_train <- df_rand[1:243, ]
df_test <- df_rand[244:303, ]
Just look what we have prepared,
prop.table(table(df_train$target))
##
## NO YES
## 0.4485597 0.5514403
prop.table(table(df_test$target))
##
## NO YES
## 0.4833333 0.5166667
This appears to be a fairly equal split, so we can now build our decision tree.
Training a model on the data
We will use the C5.0 algorithm for training our decision tree model.
heart_model <- C5.0(df_train[-14], df_train$target)
We can see some basic data about the tree by just typing its' name,
heart_model
##
## Call:
## C5.0.default(x = df_train[-14], y = df_train$target)
##
## Classification Tree
## Number of samples: 243
## Number of predictors: 13
##
## Tree size: 22
##
## Non-standard options: attempt to group attributes
The tree is listed size of 22, which indicates that the tree is 22 decisions deep.
To the decisions just type the summary() function.
summary(heart_model)
##
## Call:
## C5.0.default(x = df_train[-14], y = df_train$target)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Sep 27 04:06:43 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 243 cases (14 attributes) from undefined.data
##
## Decision tree:
##
## thal = fixed defect:
## :...ca <= 0:
## : :...oldpeak <= 1.6: YES (82/5)
## : : oldpeak > 1.6:
## : : :...age <= 61: NO (3)
## : : age > 61: YES (2)
## : ca > 0:
## : :...cp = non-anginal pain: YES (14/1)
## : cp in {asymptomatic,atypical angina,typical angina}:
## : :...sex = Male: NO (18/4)
## : sex = Female:
## : :...slope = downsloping: YES (7/1)
## : slope in {flat,upsloping}: NO (4/1)
## thal in {normal,reversible defect}:
## :...cp in {asymptomatic,atypical angina,non-anginal pain}:
## :...slope = upsloping: YES (4/1)
## : slope = downsloping:
## : :...restecg in {1,2}: YES (10/1)
## : : restecg = 0:
## : : :...trestbps <= 130: YES (2)
## : : trestbps > 130: NO (2)
## : slope = flat:
## : :...ca > 0: NO (9/1)
## : ca <= 0:
## : :...thal = normal: YES (1)
## : thal = reversible defect:
## : :...restecg in {1,2}: NO (4/1)
## : restecg = 0:
## : :...trestbps <= 136: YES (4)
## : trestbps > 136: NO (4/1)
## cp = typical angina:
## :...oldpeak > 0.5: NO (55/1)
## oldpeak <= 0.5:
## :...thal = normal: YES (1)
## thal = reversible defect:
## :...restecg in {0,2}: NO (5/1)
## restecg = 1:
## :...trestbps > 136: NO (4)
## trestbps <= 136:
## :...chol <= 244: YES (5)
## chol > 244: NO (3/1)
##
##
## Evaluation on training data (243 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 22 20( 8.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 100 9 (a): class NO
## 11 123 (b): class YES
##
##
## Attribute usage:
##
## 100.00% thal
## 65.84% oldpeak
## 64.20% cp
## 62.55% ca
## 20.99% slope
## 17.70% restecg
## 11.93% sex
## 9.88% trestbps
## 3.29% chol
## 2.06% age
##
##
## Time: 0.0 secs
The Errors field notes that the model correctly classified all, but 20 out of 243 training instances for an error rate of 8.2%.
Improving model performance (Boosting the accuracy of decision trees)
The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team.
heart_model <- C5.0(df_train[-14], df_train$target, trials = 11)
By boosting we can perform the model as 0.4% model error given (we can simply see that just typing summary() function. So we can improve our model performance by reducing error rate 8.2% to 0.4%.
Evaluating model performance
heart_predict <- predict(heart_model, df_test)
confusionMatrix(heart_predict, df_test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 26 5
## YES 3 26
##
## Accuracy : 0.8667
## 95% CI : (0.7541, 0.9406)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 1.105e-08
##
## Kappa : 0.7336
##
## Mcnemar's Test P-Value : 0.7237
##
## Sensitivity : 0.8966
## Specificity : 0.8387
## Pos Pred Value : 0.8387
## Neg Pred Value : 0.8966
## Prevalence : 0.4833
## Detection Rate : 0.4333
## Detection Prevalence : 0.5167
## Balanced Accuracy : 0.8676
##
## 'Positive' Class : NO
##
In this case we can get; Model Accuracy = 86.67%, Kappa = 73.36%, Sensitivity = 89.66%, Specificity = 83.87%.
Plotting
In the final tree we have 11 separation decisions trees, by plot() function with trial parameter we can just plot each tree.
Just exampling, it's has been plotted 4, 6 & 9 trials.
plot(heart_model, trial = 4)
plot(heart_model, trial = 6)
plot(heart_model, trial = 9)
Performing Random Forest Machine Learning Algorithm
We'd like to test wether that our model performance is more by trying using Random Forest Algorithm
set.seed(300)
heart_rf <- randomForest(target ~ . , data = df)
heart_rf
##
## Call:
## randomForest(formula = target ~ ., data = df)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.5%
## Confusion matrix:
## NO YES class.error
## NO 109 29 0.2101449
## YES 21 144 0.1272727
To look at a summary of the model's performance, we can simply type the resulting object's name, as expected, the output notes that the random forest included 500 trees and tried 3 variables at each split.
It's seemingly poor re-substitution error according to the display confusion matrix, the error (OOB) rate of 16.5%.
Evaluating random forest performance
For the most accurate comparison of model performance, we'll use repeated 10-fold cross-validation: 5 times 10-fold CV.
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
grid_rf <- expand.grid(.mtry = c(2, 3, 4, 5))
set.seed(300)
(rf_model_heart <- train(target ~ . , data = df, method = "rf",
metric = "Kappa", trControl = ctrl,
tuneGrid = grid_rf))
## Random Forest
##
## 303 samples
## 13 predictor
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 272, 273, 272, 273, 273, 272, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8182240 0.6313864
## 3 0.8228046 0.6408301
## 4 0.8216433 0.6383448
## 5 0.8222684 0.6394190
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
plot(rf_model_heart)
According to that result we'd better to take mtry as 3. Then we can predict more accuracy rate from our random forest model.
(heart_rf1 <- randomForest(df_train[-14], df_train$target,
ntree = 500, mtry = 3))
##
## Call:
## randomForest(x = df_train[-14], y = df_train$target, ntree = 500, mtry = 3)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20.58%
## Confusion matrix:
## NO YES class.error
## NO 83 26 0.2385321
## YES 24 110 0.1791045
plot(varImp(rf_model_heart), main = "Variable Impotance for model")
plot(heart_rf1, main = "Trees Vs Class Error")
From the graph the red line represent not having heart diseases and green represent having the diseases.
The black line represent overall OOB error rate.
Conclusions
After performing two algorithm which are Decision Tree and Random Forest we can conclude both had good accuracy rate and out of which decision Tree gave a better accuracy of 86.7%. We can also say that it was misclassification 13.3%.
0 Comments