Machine Learning Prediction of Heart Diseases

Source codes:- Heart diseases from GitHub

Objective

This dataset dates from 1988 and include some several databases of different country. On here I use this dataset from Kaggle website for predict the statement of heart disease using Machine Learning Algorithms.
The main object is to predict whether the given patient is having heart disease or not. So it helped for that some several variables such as, age, sex, cholesterol level, blood pressure, types of chest pains etc.
Here I used Machine Learning algorithms which are Decision Tree & Random Forest.

Let's begin by loading the required, needed libraries and importing the Heart_Disease.csv datasheet.

library(readr)
library(ggplot2)
library(Hmisc)
library(forcats)
library(dplyr)
library(psych)
library(C50)
library(caret)
library(randomForest)

df <- read_csv("newdataset/Heart_Disease.csv")

head(df)

## # A tibble: 6 x 14
##     age sex    cp      trestbps  chol fbs   restecg thalach exang oldpeak slope 
##   <dbl> <chr>  <chr>      <dbl> <dbl> <lgl>   <dbl>   <dbl> <chr>   <dbl> <chr> 
## 1    63 Male   asympt~      145   233 TRUE        0     150 no        2.3 upslo~
## 2    37 Male   non-an~      130   250 FALSE       1     187 no        3.5 upslo~
## 3    41 Female atypic~      130   204 FALSE       0     172 no        1.4 downs~
## 4    56 Male   atypic~      120   236 FALSE       1     178 no        0.8 downs~
## 5    57 Female typica~      120   354 FALSE       1     163 yes       0.6 downs~
## 6    57 Male   typica~      140   192 FALSE       1     148 no        0.4 flat  
## # ... with 3 more variables: ca <dbl>, thal <chr>, target <chr>

Here head function gives us the top 6 examples of this data for much understanding.
The independent variables are related to the patients details and based on those variables we select the dependent variable as statement of disease which have or not.

age - age in years
sex - sex of the patients
cp - chest pain type
trestbps - resting blood pressure (in mm Hg on admission to the hospital)
chol - serum cholesterols in mh/dl
fbs - (fasting blood sugar > 120 mg/dl)
restecg - resting electrocardiographic result (0- normal; 1- having ST-T wave abnormality; 2- showing probable)
thalach - maximum heart rate achieved
exang - exercise include angina
oldpeak - ST depression induced by exercise relative to rest
slope - the slope of the peak exercise ST segment
ca - number of major vessels (0-3) colored by fluoroscopy
thal - thalassemia
target - disease status

According to nature of the features we implementing the changes.

df$sex <- as.factor(df$sex)
df$cp <- as.factor(df$cp)
df$fbs <- as.factor(df$fbs)
df$restecg <- as.factor(df$restecg)
df$exang <- as.factor(df$exang)
df$slope <- as.factor(df$slope)
df$thal <- as.factor(df$thal)
df$target <- as.factor(df$target)

str(df)

## spec_tbl_df [303 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:303] 63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "asymptomatic",..: 1 3 2 2 4 4 2 2 3 3 ...
##  $ trestbps: num [1:303] 145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : num [1:303] 233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : num [1:303] 150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num [1:303] 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "downsloping",..: 3 3 1 1 1 2 2 1 1 1 ...
##  $ ca      : num [1:303] 0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : Factor w/ 3 levels "fixed defect",..: 2 1 1 1 1 2 1 3 3 1 ...
##  $ target  : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   cp = col_character(),
##   ..   trestbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_logical(),
##   ..   restecg = col_double(),
##   ..   thalach = col_double(),
##   ..   exang = col_character(),
##   ..   oldpeak = col_double(),
##   ..   slope = col_character(),
##   ..   ca = col_double(),
##   ..   thal = col_character(),
##   ..   target = col_character()
##   .. )

From the srt() function we can get an idea about structure of the dataset to fine which are the characters and numeric.

Hmisc::describe(df)

## df 
## 
##  14  Variables      303  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       41    0.999    54.37    10.36     39.1     42.0 
##      .25      .50      .75      .90      .95 
##     47.5     55.0     61.0     66.0     68.0 
## 
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      303        0        2 
##                         
## Value      Female   Male
## Frequency      96    207
## Proportion  0.317  0.683
## --------------------------------------------------------------------------------
## cp 
##        n  missing distinct 
##      303        0        4 
##                                                                               
## Value          asymptomatic  atypical angina non-anginal pain   typical angina
## Frequency                23               50               87              143
## Proportion            0.076            0.165            0.287            0.472
## --------------------------------------------------------------------------------
## trestbps 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       49    0.995    131.6    19.32      108      110 
##      .25      .50      .75      .90      .95 
##      120      130      140      152      160 
## 
## lowest :  94 100 101 102 104, highest: 174 178 180 192 200
## --------------------------------------------------------------------------------
## chol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0      152        1    246.3    55.95    175.0    188.0 
##      .25      .50      .75      .90      .95 
##    211.0    240.0    274.5    308.8    326.9 
## 
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## --------------------------------------------------------------------------------
## fbs 
##        n  missing distinct 
##      303        0        2 
##                       
## Value      FALSE  TRUE
## Frequency    258    45
## Proportion 0.851 0.149
## --------------------------------------------------------------------------------
## restecg 
##        n  missing distinct 
##      303        0        3 
##                             
## Value          0     1     2
## Frequency    147   152     4
## Proportion 0.485 0.502 0.013
## --------------------------------------------------------------------------------
## thalach 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       91        1    149.6    25.77    108.1    116.0 
##      .25      .50      .75      .90      .95 
##    133.5    153.0    166.0    176.6    181.9 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## exang 
##        n  missing distinct 
##      303        0        2 
##                       
## Value         no   yes
## Frequency    204    99
## Proportion 0.673 0.327
## --------------------------------------------------------------------------------
## oldpeak 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       40    0.964     1.04    1.225      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.8      1.6      2.8      3.4 
## 
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## --------------------------------------------------------------------------------
## slope 
##        n  missing distinct 
##      303        0        3 
##                                               
## Value      downsloping        flat   upsloping
## Frequency          142         140          21
## Proportion       0.469       0.462       0.069
## --------------------------------------------------------------------------------
## ca 
##        n  missing distinct     Info     Mean      Gmd 
##      303        0        5    0.795   0.7294    1.005 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency    175    65    38    20     5
## Proportion 0.578 0.215 0.125 0.066 0.017
## --------------------------------------------------------------------------------
## thal 
##        n  missing distinct 
##      303        0        3 
##                                                                 
## Value           fixed defect            normal reversible defect
## Frequency                166                20               117
## Proportion             0.548             0.066             0.386
## --------------------------------------------------------------------------------
## target 
##        n  missing distinct 
##      303        0        2 
##                       
## Value         NO   YES
## Frequency    138   165
## Proportion 0.455 0.545
## --------------------------------------------------------------------------------

From describe function we can explore the content of the variable more than summary function.
According to these things we can extract some of the features of the data as graphically. Then we can observe the things more than just appearing.

Correlations

pairs.panels(df[c("age", "trestbps", "chol", "thalach", "oldpeak")])

From above output the diagonal, the scatterplots has been replaced with a correlation matrix. On the diagonal, a histogram depicting the distribution of values for each feature is shown. Finally, the scatterplots below the diagonal now are presented with additional visual information.
So we have highest correlation between age & maximum heart. That is a negative (-0.4); lowest correlation between maximum heart & serum cholesterols (-0.01)
(The oval-shaped object on each scatterplot is a correlation ellipse. It provides a visualization of how strongly correlated the variables are. The dot at the center of the ellipse indicates the point of the mean value for the x axis variable and y axis variable. The correlation between the two variables is indicated by the shape of the ellipse; the more it is stretched, the stronger the correlation.)

Visualization of the gender

Hmisc::describe(df$sex)

## df$sex 
##        n  missing distinct 
##      303        0        2 
##                         
## Value      Female   Male
## Frequency      96    207
## Proportion  0.317  0.683

ggplot(data = df) + 
  geom_bar(mapping = aes(sex), fill = c("lightyellow3", "lavenderblush3")) +
  labs(
    title = paste("Gender"),
    y = "Number of patients", 
    x = "Gender" ) +
  theme_dark()

There are much more male patients than female. It's like half of the male patients.

Visualization of the chest pain type

Hmisc::describe(df$cp)

## df$cp 
##        n  missing distinct 
##      303        0        4 
##                                                                               
## Value          asymptomatic  atypical angina non-anginal pain   typical angina
## Frequency                23               50               87              143
## Proportion            0.076            0.165            0.287            0.472

ggplot(data = df, 
       mapping = aes(cp))+
  geom_bar(fill = c(rep(c("aquamarine4", "aquamarine3",
                          "aquamarine2", "aquamarine1"),
                        times = 2))) +
  facet_wrap(~sex, nrow = 2) +
  scale_y_continuous(breaks = seq(0, 100, by = 10))+
  labs(
    title = "Chest pain types by Gender",
    x = "Pain Type", y = "Number of patients") +
  coord_flip() +
  theme_minimal()

Here the typical angina is the most pain type of the chest of both sex. In addition to that there are male patients over 100 of that pain type.

Visualization of the resting blood pressure Vs serum cholesterols (highest values)

Hmisc::describe(df[4:5])

## df[4:5] 
## 
##  2  Variables      303  Observations
## --------------------------------------------------------------------------------
## trestbps 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       49    0.995    131.6    19.32      108      110 
##      .25      .50      .75      .90      .95 
##      120      130      140      152      160 
## 
## lowest :  94 100 101 102 104, highest: 174 178 180 192 200
## --------------------------------------------------------------------------------
## chol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0      152        1    246.3    55.95    175.0    188.0 
##      .25      .50      .75      .90      .95 
##    211.0    240.0    274.5    308.8    326.9 
## 
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## --------------------------------------------------------------------------------

(bp <- filter(df, trestbps >= 135, chol >= 250) %>%
  select(sex, chol, trestbps,fbs) %>%
    arrange(trestbps))

## # A tibble: 52 x 4
##    sex     chol trestbps fbs  
##    <fct>  <dbl>    <dbl> <fct>
##  1 Female   304      135 TRUE 
##  2 Female   252      135 FALSE
##  3 Female   250      135 FALSE
##  4 Male     254      135 FALSE
##  5 Female   319      136 TRUE 
##  6 Male     315      136 FALSE
##  7 Male     257      138 FALSE
##  8 Male     271      138 FALSE
##  9 Male     282      138 TRUE 
## 10 Female   294      138 TRUE 
## # ... with 42 more rows

ggplot(bp, mapping = aes(x = chol, y =trestbps)) + 
  geom_point(col = "chartreuse")+ 
  facet_wrap(fbs~sex) +
  labs(
    title = "Resting blood pressure Vs Serum cholesterols",
    subtitle = "In order to Gender & Fasting blood sugar",
    caption = "-Highest values (trestbps >= 135, chol >= 250)",
    x = "Serum cholesterols", y = "Resting blood pressure"
  ) +
  scale_y_continuous(breaks = seq(140, 200, by = 10)) +
  scale_x_continuous(breaks = seq(250, 500, by = 25)) +
  theme_dark()

fbs(fasting blood sugar > 120 mg/dl) is divided as True & False logistic variable with Gender. Exploring these graphs we can observe that only 4 male patients' classify as high fbs and most of the patients are lower fbs who are male.

Visualization of heart rate Vs age

Hmisc::describe(df[7:9])

## df[7:9] 
## 
##  3  Variables      303  Observations
## --------------------------------------------------------------------------------
## restecg 
##        n  missing distinct 
##      303        0        3 
##                             
## Value          0     1     2
## Frequency    147   152     4
## Proportion 0.485 0.502 0.013
## --------------------------------------------------------------------------------
## thalach 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       91        1    149.6    25.77    108.1    116.0 
##      .25      .50      .75      .90      .95 
##    133.5    153.0    166.0    176.6    181.9 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## exang 
##        n  missing distinct 
##      303        0        2 
##                       
## Value         no   yes
## Frequency    204    99
## Proportion 0.673 0.327
## --------------------------------------------------------------------------------

ggplot(data = df, mapping = aes(x = age, y = thalach)) +
  geom_boxplot(fill = "grey25", col = "white") +
  geom_point(color = "darkgoldenrod1") +
  facet_wrap(restecg~exang~sex, nrow = 1) +
  scale_y_continuous(breaks = seq(80,200, by = 10)) +
  labs(
    title = "Maximum heart rate Vs Age",
    subtitle = "-In order to Electrocardiographic result, angina & Gender",
    x = "Age in year", y = "Maximum heart rate") +
  theme_dark()

Maximum heart rate achieved has been taken here for explore the variation with age in years. Boxplot and scatter plot have carried out here by them we can see how the data point are varying with in order to Electrocardiographic result, angina & Gender.
The resting electrocardiographic result are taken as; 0- normal, 1- having ST-T wave abnormality & 2- showing probable. And we can observe here only few patients have that problem.
The exercise include angina are taken as Yes or No category.
In order to these things, we are able to explore the data as much easier.

Visualization of thalassemia

Hmisc::describe(df[12:13])

## df[12:13] 
## 
##  2  Variables      303  Observations
## --------------------------------------------------------------------------------
## ca 
##        n  missing distinct     Info     Mean      Gmd 
##      303        0        5    0.795   0.7294    1.005 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency    175    65    38    20     5
## Proportion 0.578 0.215 0.125 0.066 0.017
## --------------------------------------------------------------------------------
## thal 
##        n  missing distinct 
##      303        0        3 
##                                                                 
## Value           fixed defect            normal reversible defect
## Frequency                166                20               117
## Proportion             0.548             0.066             0.386
## --------------------------------------------------------------------------------

ggplot(df, mapping = aes(x = ca)) +
  geom_bar(fill = "seagreen1") +
  facet_wrap(~thal ) +
  labs(
    title = "Thalassemia disease case variation",
    subtitle = "-In order to vessels colored by fluoroscop",
    x = "vessels colored by fluoroscop", y = "Number of patients") +
  theme_dark()

We can observed most of patients include in fixed defect thalassemia disease and few patients have normal case of thalassemia.

Performing Decision Tree Machine Learning Algorithm

Preparation the data

Firstly, we are dividing our dataset into training and test data as 80% and 20% of the initial. (243 records for training & 60 records for test)
So creating that datasets we should consider about the random order of the data.
We'll solve this problem randomly ordering our heart disease data set prior splitting.
The order() function is used to rearrange the list. For random number generation, we'll use runif() function.

set.seed(12345)
df_rand <- df[order(runif(303)), ]
df_train <- df_rand[1:243, ]
df_test <- df_rand[244:303, ]

Just look what we have prepared,

prop.table(table(df_train$target))

## 
##        NO       YES 
## 0.4485597 0.5514403

prop.table(table(df_test$target))

## 
##        NO       YES 
## 0.4833333 0.5166667

This appears to be a fairly equal split, so we can now build our decision tree.

Training a model on the data

We will use the C5.0 algorithm for training our decision tree model.

heart_model <- C5.0(df_train[-14], df_train$target)

We can see some basic data about the tree by just typing its' name,

heart_model

## 
## Call:
## C5.0.default(x = df_train[-14], y = df_train$target)
## 
## Classification Tree
## Number of samples: 243 
## Number of predictors: 13 
## 
## Tree size: 22 
## 
## Non-standard options: attempt to group attributes

The tree is listed size of 22, which indicates that the tree is 22 decisions deep.
To the decisions just type the summary() function.

summary(heart_model)

## 
## Call:
## C5.0.default(x = df_train[-14], y = df_train$target)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Sep 27 04:06:43 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 243 cases (14 attributes) from undefined.data
## 
## Decision tree:
## 
## thal = fixed defect:
## :...ca <= 0:
## :   :...oldpeak <= 1.6: YES (82/5)
## :   :   oldpeak > 1.6:
## :   :   :...age <= 61: NO (3)
## :   :       age > 61: YES (2)
## :   ca > 0:
## :   :...cp = non-anginal pain: YES (14/1)
## :       cp in {asymptomatic,atypical angina,typical angina}:
## :       :...sex = Male: NO (18/4)
## :           sex = Female:
## :           :...slope = downsloping: YES (7/1)
## :               slope in {flat,upsloping}: NO (4/1)
## thal in {normal,reversible defect}:
## :...cp in {asymptomatic,atypical angina,non-anginal pain}:
##     :...slope = upsloping: YES (4/1)
##     :   slope = downsloping:
##     :   :...restecg in {1,2}: YES (10/1)
##     :   :   restecg = 0:
##     :   :   :...trestbps <= 130: YES (2)
##     :   :       trestbps > 130: NO (2)
##     :   slope = flat:
##     :   :...ca > 0: NO (9/1)
##     :       ca <= 0:
##     :       :...thal = normal: YES (1)
##     :           thal = reversible defect:
##     :           :...restecg in {1,2}: NO (4/1)
##     :               restecg = 0:
##     :               :...trestbps <= 136: YES (4)
##     :                   trestbps > 136: NO (4/1)
##     cp = typical angina:
##     :...oldpeak > 0.5: NO (55/1)
##         oldpeak <= 0.5:
##         :...thal = normal: YES (1)
##             thal = reversible defect:
##             :...restecg in {0,2}: NO (5/1)
##                 restecg = 1:
##                 :...trestbps > 136: NO (4)
##                     trestbps <= 136:
##                     :...chol <= 244: YES (5)
##                         chol > 244: NO (3/1)
## 
## 
## Evaluation on training data (243 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      22   20( 8.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     100     9    (a): class NO
##      11   123    (b): class YES
## 
## 
##  Attribute usage:
## 
##  100.00% thal
##   65.84% oldpeak
##   64.20% cp
##   62.55% ca
##   20.99% slope
##   17.70% restecg
##   11.93% sex
##    9.88% trestbps
##    3.29% chol
##    2.06% age
## 
## 
## Time: 0.0 secs

The Errors field notes that the model correctly classified all, but 20 out of 243 training instances for an error rate of 8.2%.

Improving model performance (Boosting the accuracy of decision trees)

The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team.

heart_model <- C5.0(df_train[-14], df_train$target, trials = 11)

By boosting we can perform the model as 0.4% model error given (we can simply see that just typing summary() function. So we can improve our model performance by reducing error rate 8.2% to 0.4%.

Evaluating model performance

heart_predict <- predict(heart_model, df_test)
confusionMatrix(heart_predict, df_test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NO YES
##        NO  26   5
##        YES  3  26
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.7541, 0.9406)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 1.105e-08       
##                                           
##                   Kappa : 0.7336          
##                                           
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.8966          
##             Specificity : 0.8387          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.8966          
##              Prevalence : 0.4833          
##          Detection Rate : 0.4333          
##    Detection Prevalence : 0.5167          
##       Balanced Accuracy : 0.8676          
##                                           
##        'Positive' Class : NO              
##

In this case we can get; Model Accuracy = 86.67%, Kappa = 73.36%, Sensitivity = 89.66%, Specificity = 83.87%.

Plotting

In the final tree we have 11 separation decisions trees, by plot() function with trial parameter we can just plot each tree.
Just exampling, it's has been plotted 4, 6 & 9 trials.

plot(heart_model, trial = 4)

plot(heart_model, trial = 6)

plot(heart_model, trial = 9)

Performing Random Forest Machine Learning Algorithm

We'd like to test wether that our model performance is more by trying using Random Forest Algorithm

set.seed(300)
heart_rf <- randomForest(target ~ . , data = df)
heart_rf

## 
## Call:
##  randomForest(formula = target ~ ., data = df) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.5%
## Confusion matrix:
##      NO YES class.error
## NO  109  29   0.2101449
## YES  21 144   0.1272727

To look at a summary of the model's performance, we can simply type the resulting object's name, as expected, the output notes that the random forest included 500 trees and tried 3 variables at each split.
It's seemingly poor re-substitution error according to the display confusion matrix, the error (OOB) rate of 16.5%.

Evaluating random forest performance

For the most accurate comparison of model performance, we'll use repeated 10-fold cross-validation: 5 times 10-fold CV.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
grid_rf <- expand.grid(.mtry = c(2, 3, 4, 5))
set.seed(300)
(rf_model_heart <- train(target ~ . , data = df, method = "rf",
                metric = "Kappa", trControl = ctrl,
                tuneGrid = grid_rf))

## Random Forest 
## 
## 303 samples
##  13 predictor
##   2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 272, 273, 272, 273, 273, 272, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8182240  0.6313864
##   3     0.8228046  0.6408301
##   4     0.8216433  0.6383448
##   5     0.8222684  0.6394190
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.

plot(rf_model_heart)

According to that result we'd better to take mtry as 3. Then we can predict more accuracy rate from our random forest model.

(heart_rf1 <- randomForest(df_train[-14], df_train$target,
                            ntree = 500, mtry = 3))

## 
## Call:
##  randomForest(x = df_train[-14], y = df_train$target, ntree = 500,      mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20.58%
## Confusion matrix:
##     NO YES class.error
## NO  83  26   0.2385321
## YES 24 110   0.1791045

plot(varImp(rf_model_heart), main = "Variable Impotance for model")

plot(heart_rf1, main = "Trees Vs Class Error")

From the graph the red line represent not having heart diseases and green represent having the diseases.
The black line represent overall OOB error rate.

Conclusions

After performing two algorithm which are Decision Tree and Random Forest we can conclude both had good accuracy rate and out of which decision Tree gave a better accuracy of 86.7%. We can also say that it was misclassification 13.3%.

STATISTICS RESEARCHES REPORTS

Heart Diseases Prediction

Machine Learning Prediction of Heart Diseases

Dataset :- Heart disease csv file

Objective

Correlations

Visualization of the gender

Visualization of the chest pain type

Visualization of the resting blood pressure Vs serum cholesterols (highest values)

Visualization of heart rate Vs age

Visualization of thalassemia

Performing Decision Tree Machine Learning Algorithm

Preparation the data

Training a model on the data

Improving model performance (Boosting the accuracy of decision trees)

Evaluating model performance

Plotting

Performing Random Forest Machine Learning Algorithm

Evaluating random forest performance

Conclusions

Post a Comment

0 Comments

Followers

INTRODUCTION

CONECTIONS

Contact Form

Menu Footer Widget